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PRACTICAL TRANSLATORS FOR LR(k) LANGUAGES 

Abstract 

A context-free syntactical translator (CFST) is a machine which 
defines a translation from one context-free language to another. A 
transduction grammar is a formal system based on a context-free 
grammar and it specifies a context-free syntactical translation. A 
simple suffix transduction grammar based on a context-free grammar 
which is LR(k) specifies a translation which can be defined by a 
deterministic push-down automation (DPDA) . 

A method is presented for automatically constructing CFSTs (DPDAs) 
from those simple suffix transduction grammars which are based on the 
LR(k) grammars. The method is developed by first considering gram- 
matical analysis from the string-manipulation viewpoint, then converting 
the resulting string-manipulation algorithms to DPDAs, and finally 
considering translation from the automata-theoretic viewpoint. 

The results are relevant to the automatic construction of compilers 
from formal specifications of programming languages. If the specifi- 
cations are, at least in part, based on LR(k) grammars, then corres- 
ponding compilers can be constructed which are, in part, based on 
CFSTs . 



*This report reproduces a thesis of the same title submitted 
to the Electrical Engineering Department, Massachusetts 
Institute of Technology, in partial fulfillment of the re- 
quirements for the degree of Doctor of Philosophy. 
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Chapter 1 
INTRODUCTION 

1. 1 Subject 

The general subject of interest in this dissertation is "programming 
linguistics", which we consider to be a science concerning the design and 
specification of programming languages and the translation and subsequent 
evaluation and execution of programs. In particular, we are primarily 
interested in the problem of automatically generating translators from 
formal specifications of translations based on context-free (CF) grammars. 

1. 2 Languages , Translations 

In the sequel we use the two words, language and translation (also 
translator), in both the formal and informal sense. The proper sense in 
each case is always clear from context. A language is defined formally in 
Chapter 2 to be a set of strings. However, when we say "programming 
language" or "language designer", we have in mind a more intuitive notion. 
For instance, when we refer to the "language" ALGOL 60, we mean the 
syntax and semantics, the set of strings and their meanings, the lexicon 
and the grammar, operator precedences and associativities, scopes of 
variables, etc. Similarly, our formal definition in Chapter 5 of translations 
limits them to mappings from one set of strings to another, but we also use 
the term to mean a mapping from one set of things, of any sort, to another, 
of any sort. 
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1 . 3 Viewpoint: TWSs, Modular Compilers 

For purposes of discussion we picture ourselves throughout the 
dissertation as a subcontractor to a language designer. The designer has 
a contract to design and implement a practical algorithmic programming 
language, and he has subcontracted to us the task of implementing a com- 
piler for his language. 

We desire to automate our implementation procedures for three 
reasons: (1) the designer is likely to want to experiment to some extent 
to determine the effect of various design decisions, and he would like fairly- 
short response time, (2) we expect to receive more such contracts in the 
future, and (3) implementing a compiler usually requires many man-hours 
of expensive programmer time. The embodiment of such an automation is 
called a translator writing system (TWS) (see the survey (F&G 68)). It is 
a system which takes as input the specification of the syntax and translation 
of a language and which produces as output a compiler for that language. 

The questions, then, which confront us are: how do we specify 
programming languages and their translations, and how can we map these 
specifications into compilers? We choose a modular approach which is 
a combination of some of the notions of Cheatham (Che 67) and Landin 
(Lan 66). We find it convenient, even natural, to section our specifications 
into components. For instance, we might specify separately the lexicon, the 



context-free syntax, and the context-sensitive syntax. (We discuss briefly 
in Section 1. 4 and extensively in Chapter 5 our reasons for sectioning the 
specifications in certain ways. ) Further, we find it convenient to base 
some aspects of our translation specification on these different components. It 
is reasonable, then, to view a compiler, conceptually at least, as a con- 
catenation of several corresponding subtranslators; i. e. , as modularized. 

The adoption of this viewpoint results in three significant advantages 
relative to a less modular approach. First, the otherwise complex task of 
compiling is viewed as broken into several relatively simple components, 
each of which may be analyzed virtually independently of the others. Second, 
the task of a TWS is viewed as the separate generation of several subtrans- 
lators, followed by their optimal combination to form a compiler. Third, 
because the specifications of some of the subtranslations can be naturally 
and conveniently based on formal grammars, the abundant results of both 
formal- grammar theory and automata theory are relevant to the corres- 
ponding translators and their automatic generation. We consider the 
theoretical underpinnings which accrue from the latter to be important 
because (1) they allow us to make provable statements regarding the efficiency, 
execution time, size, etc. , of our translators, (2) they allow us to modify 
our translators in a rational way to get an optimal compromise between 
time and space, (3) they help us avoid ad hoc, ill-understood modifications 
which make the subsequent combination of translators difficult, if not 
impossible or incorrect, and (4) they add a certain degree of "cleanliness" 
to our results. 
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A possible criticism of our approach is that the separate analyses of 
the components may result in translation methods or devices which, when 
combined, will form a compiler with gross redundancies, such as repeated 
building and scanning of data structures, which cannot be eliminated by any 
reasonably simple procedure. We do not believe that this will be the case, 
but we shall not go so far as to make this belief a thesis to be proved here. 
The results of ourselves and others are, however, steps in that direction. 

One existing result in this vein is presented in (Joh 68). It is a 
method of automatically generating practical "lexical analyzers", really 
"lexical translators", from a specification -based on regular expressions. 
The technique is based directly on some rudimentary notions of finite~state 
machine theory. It is our desire to get similar results for "CF syntax 
analyzers", really "CF syntactical translators (CFSTs)". 

1.4 The Role of CF Grammars 

Another belief which is fundamental to our work is that CF grammars 
can be used in a natural and convenient way as bases for the specifications 
of significant portions of the syntax and translation of programming languages, 
and we believe that this includes useful languages in which highly readable 
programs can be written. Furthermore, we find that a well designed CF 
grammer makes a concise, readable, and useful syntactical reference for a 
language, a reference from which operator precedences and associativities, 
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scopes of definitions, and other such "structural properties", can be quickly 
and easily determined. 

Having stated our view in positive terms, we now add some disclaimers. 
(1) We do not contend that it is obvious how to design CF grammars so that 
they exhibit the above stated properties. For instance, we do not think that 
the (probably) best known CF grammar, that of ALGOL 60 (Nau 63), is an 
example of a good syntactical reference; it seems more complex than it 
needs to be. However, we illustrate in Chapter 7 a grammar which partially 
specifies a language comparable to ALGOL in many respects, and which, we 
think, is a reference with the desired properties. Unfortunately, the value 
of our results is somewhat limited until this grammar design problem is 
better understood. We have pursued our research, then, on the hope that 
some results relating to this problem are forthcoming. (2) We do not contend 
that programming languages should be CF. We merely believe that much of 
their syntax can be easily defined via CF grammars and that the remaining 
syntax, e. g. , "context-sensitive features", can then be defined in other ways, 
probably related to the CF grammars. See for example(Knu 66). (3) Neither 
do we contend that CF grammars are a panecea with respect to language 
specification. Indeed, they are woefully inadequate for indicating nonasso- 
ciative operators, for instance; and there are certainly other ways (see Chapter 
8) in which their usefulness would be enhanced if they could be extended. We 
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merely believe that they are the most useful devices currently available for 
specifying many of the "structural properties" of languages. 

LR(k) grammars . Actually, we do not intend to cover all of the CF 
grammars here. Our experience is that, if a designer sets out to design 
an unambiguous CF grammar to specify the "structural properties" of a 
language, his result will be an LR(k) grammar (Knu 65); i. e. , a grammar 
whose sentences can be analyzed (parsed) during a single, deterministic scan 
from left to right. Intuitively, we feel that this situation obtains because the 
language is presumably designed to be written and read by humans, and humans, 
at least those who are used to reading natural languages from left to right, 
would probably find programs quite unreadable if they could not be syntactically 
analyzed during a single scan from left to right. 

Thus, to the extent that unambiguity is a desirable characteristic of 
a syntactical reference, anyway, our results should be as useful as if they 
covered all CF grammars. We do not find the restriction to unambiguity 
bothersome. 

The reason we choose the LR(k) grammars, in particular, is that they 
form the largest set of CF grammars whose sentences can be analyzed quickly 
by a deterministic, left-to-right automaton, as we show. We can therefore 
automatically generate at least part of a compiler for any language whose 
specification is, in part, based on an LR(k) grammar, and we can expect that 
part of the compiler to be fast. 



-13- 

Translators. Finally, we emphasize that we are really interested in 
translators rather than just parsers, for reasons which we discuss exten- 
sively in Chapter 6. As a method for specifying CF syntactical translations 
we have chosen the "transduction grammars" of Lewis and Stearns (L&S 68). 
In fact, we use only the "simple suffix" transduction grammars (SSTGs) 
(see Chapter 6). Again our choice was based on the fact the method seems 
both natural and convenient for our purposes and on the fact it has strong 
ties with automata theory. 

1. 5 Thesis 

It is our thesis that by applying some rudimentary notions of 
automata theory we can develop a practical method of automatically generating 
CFSTs from those SSTGs which are based on the LR(k) grammars . Further- 
more, if the SSTGs in question are used to specify the CF syntactical 
translations of useful, readable programming languages, the resulting 
CFSTs will be of practical size and speed. 

By a "practical" method or CFST we mean one which is competitive 
with the methods or "recognizers" of section II. B of (F&G 68); i. e. , ones 
which have actually been used in the construction of compilers. Our aim is 
not so much to improve on the size and speed of CFSTs as it is to provide the 
language designer with flexibility. With existing methods the designer usually 
has to modify his grammar substantially before it is acceptable to the method. 
By covering all the LR(k) grammars we, hopefully, get a method which will accept 
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grammars as they are designed as syntactical references for languages, with 
no modifications. If the grammars are unambiguous and if all their sentences 
can be parsed deterministically, during a single scan from left to right, the 
latter will be true. 

1. 6 Approach 

Our approach to this problem is basically inspired by and quite 
similar to Knuth's. However, we draw even more heavily on automatic 
theory than he did, at least with respect to getting practical results, and we 
treat translation rather than just parsing. We treat parsers first because 
they provide a convenient basis from which we can develop translators. This 
follows from the fact that the specifications of our translations are based on 
CF grammars. 

We begin in Chapter 2 by discussing parsing from the string-manipulation 
viewpoint, as is typical when working with formal grammars. We present 
a particular parser, described as a string-manipulation algorithm, and 
motivate our own definition of the LR(k) grammars. 

In Chapter 3 we develop a foundation by treating only the LR(0) 
grammars. We draw on finite - state machine (FSM) theory to develop a 
machine for making basic string-manipulation (parsing) decisions. Then we 
shift entirely to automata theory by deriving from our string-manipulation 
algorithm, plus FSM, a deterministic push-down automaton (DPDA). That 
is, we get DPDAs as parsers for LE(0) grammars. 
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In Chapter 4 we find that a large and useful subset of the LR(k) 
grammars, which we call the "Simple LR(k)" grammars, can be covered 
by first constructing an FSM as in the case of an LR(0) grammar, then 
adding to the machine some "look-ahead" information computed in a simple 
way, and finally converting the string-manipulation algorithm and FSM 
to a DPDA with "look-ahead". 

We generalize to cover all LR(k) grammars in Chapter 5. We find 
that parsers for some of these grammars can be constructed just as are 
those for Simple LR(k) grammars, if more complex methods for computing 
"look-ahead" information are employed. In general, however, we find 
that some state- splitting operations must be applied to the FSMs along with 
the more complex computations of "look-ahead". Our development in 
Chapter 5 is in two phases. We first cover a set of grammars of the 
"bounded context" variety and then we generalize to cover all LR(k) grammars. 

Our result going into Chapter 6, then, is a parser-constructing 
technique which grows in complexity as it discovers the complexity of the 
grammar at hand. 

In Chapter 6 we motivate the abstraction of a string- to- string 
translation from the compilation process. Then we define transduction grammars 
for use in specifying these translations and show how to convert our parsers 
to translators. Finally, we show how we envision our translators fitting into 
compilers, via an explicit model, and we discuss the relevance of our results 
to the design and specification of languages, translations and compilers. 
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We illustrate in Chapter 7 the practicability of our scheme. We 
first summarize our translator constructing technique as a whole. Then 
we propose a method of implementing the translators, apply the method 
to a particular, practical transduction grammar, and show that our scheme 
compares favorably with an existing, practical technique. 

We end the dissertation with Chapter 8 in which we note some 
developments which are desirable before our scheme is incorporated in a 
TWS, state some conclusions, and pose some question for future research. 
1. 7 Efficiency, Complexity, Recognizers 

Several more informal definitions are in order before we proceed. 

In the sequel we frequently refer to the "efficiency" of our translators. 
By "time-efficiency" we mean the ability to effect a translation using a 
minimum number of "machine operations", and therefore time. In Chapter 
4 we give a specific definition in terms of an ideal machine. By "space- 
efficiency" we mean the ratio of the amount of space necessary to store 
the specification of a translation to that necessary to store the corresponding 
translator. We define this more precisely in Chapter 7. 

The "size" of a grammar is the number of symbols required to write 
down all the left and right parts of the productions. By "grammatical 
complexity" we mean a measure of the time required to construct a parser 
for a grammar when using our technique. Although this definition may seem 
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to reflect some egotism, we use it for lack of a better choice. It does, 
however, seem to correspond to the intuitive notion fairly well. Our 
measure depends both on the size of the grammar and on the "complexity" 
of the functions which must be employed to compute "look-ahead" and 
state- splitting. 

Finally, we use the word"recognizer" in a more technical sense 
than it was used in (FAG 68). We adopt the automata-theoretic notion that 
a recognizer is a machine which reads a string and either accepts or rejects 
it, as far as its being in a given language is concerned. Our parsers and 
translators output considerably more information than is contained in a 
simple "yes" or "no" from a recognizer. 
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Chapter 2 
PRELIMINARIES 

2. 1 Notation, Preliminary Definitions 

We begin by defining terms and notation. We assume the reader 
is familiar with the properties of symbols, strings of symbols, regular 
expressions , and languages , finite state machines (FSMs), formal grammars , 
and both deterministic and nondeterministic pushdown automata (DPDAs and 
NPDAs). 

A context-free (CF) grammar is a quadruple (V—, V^, S, P) where 
V„ is a finite set of symbols called terminals , V is a finite set of symbols 
distinct from those in V T called nonterminals , S is a distinguished member of 
V called the starting symbol , and P is a finite set of pairs called productions . 
Each production is written A~*Ui and has a left part A in V^. and right part 
to in V where V = V^ V . V denotes the set of all strings composed of 
symbols in V, including the empty string. 

Without loss of generality we conventionalize that (i) the productions 
are arbitrarily numbered from to s, and (ii) the zeroth production is of 
the form S^pS'-l , where S' is sort of a subordinate starting symbol and S 
and the terminal "pad" symbols |- and -\ appear in none of the other productions. 

We use Latin capitals to denote nonterminals, lower case Latin letters, 
digits and special symbols (e. g. , +, *, :, etc. ) to denote terminals, and 
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lower case Greek letters to denote strings. An exception is that we reserve 
e to denote the empty string. We use |0| to denote the length of (number 
of symbols in) the string j3, and k:j3 to denote the first k symbols of if 
|/3 | >k and otherwise. If a = <p/3 is a string, then <p is a prefix and j3 
is a suffix of a, and a is the concatenation of <p and /3. 

In the sequel, we often use for examples the grammar 



G ± = ({(,), i, t+, KH3.fS,E,T,P),S, P t ) 



where P consists of the following productions: 

(0) S- |" E -| (4) T ~* P 

(1) E- E + T (5) P- i 

(2) E- T (6) P- (E) 

(3) T- P t T 

If A -♦ CO is a production, an immediate derivation of one string 
a = pajS from another a' = pAj8 is written a' -a. We say a is immediately 
derivable from a' via application of the production A - * to to a particular 
occurrence of A in a'. The transitive completion of this relation is a 
derivation and is written a'- a, which means there exist strings 0i , a ., . . . ,a 

such that a' = 0* -a, - ... - a = a for n > 0. A right derivation , written 

1 n — — B 

a'- Z, <*, is one in which for i = 1, 2, . . . , n each fit. is immediately derivable from 
R i 
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a. , via application of a production to the rightmost nonterminal ino. , . 
We choose the right derivation as our canonical derivation . 

A terminal string is one consisting entirely of terminals. A 
sentential form- is any string derivable from S. A sentence is any terminal 
sentential form. The language L(G) generated by G is the set of sentences; 
i. e. , L(G) = {77 eV^l | S — • "77 J - A right sentential form , which we choose as 
our canonical form is any string canonically derivable from S. 

An example of a canonical derivation of a string t) in L(G. ) follows, 
where in each canonical form we underline the rightmost nonterminal and 
indicate the production used to derive the next form. 



Canonical Form 

S 

|- E+ T -| 
\r E+ P -| 
[-E+H 

f-Pf T+ i -j 

|-pf p+H 
l-pt i +ii 

f-i t i + i -( = < n 1 





Production 


(0) 


S - f- E H 


(1) 


E- E + T 


(4) 


T- P 


(5) 


P- i 


(2) 


E- T 


(3) 


T- PfT 


(4) 


T- P 


(5) 


P- i 


(5) 


P- i 
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Note that a canonical derivation is a strictly right-to-left process since we 
always replace the rightmost nonterminal. 

We assume that grammar G has no useless productions; i. e. , we 
assume that for each production A - 0> there exists a derivation 
S - ct6S where a, 6, and 8 are terminal strings. Presumably our language 
designer has made an error if there are useless productions in the grammar 
Fortunately, well known methods exist for detecting such errors (see(Gin 66), 
section 1. 4). 

Loosely speaking, a parse of a string is some indication of how that 
string was derived. In particular, a canonical parse of a sentential form 
a. is the reverse of the sequence of productions (or equivalently, the numbers 
thereof) used in a canonical derivation of a . We refer to the action of 
determining a parse as parsing, the determination constitutes a grammatical 
analysis , and a parsing algorithm is called a parser . 

Being interested for the present in grammatical analysis, we view a 
grammar G as serving two purposes: (i) it is a set of rules for generating 
the sentences in L(G), and (ii) it defines the input/output relations of any 
corresponding canonical parser; i. e. , if the input to the parser is a string 
7} in L(G), the output should be a canonical parse of r\. However, because 
the latter is ill defined in the case that rj has several canonical parses and 
because we desire ultimately to generate a unique translation of rj from a 
unique canonical parse, we are led to the following definition. A grammar 
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G is unambiguous if and only if each canonical form, and therefore each 
sentence, has a unique canonical parse. It follows immediately that each 
canonical form of an unambiguous grammar has a unique canonical derivation. 

2. 2 Characteristic Strings 

We would like to describe a particular canonical parser, but first 
we define some strings which together provide a useful characterization 
of the decisions which must be made while parsing. 

Definition 2. 1. Let G be a CF grammar with s + 1 productions. 

Let {# , #.,,...,#} be a set of special symbols not in V, called 

#- symbols , such that # is associated with production 0, # 

with 1, . . . , and # with s. Let the p-th production be A - W, 

and let a' = pAS and a = ptoB be canonical forms such that there 

exists a canonical derivation S -* „ a' -♦ „£*. Then pu)# is a 

K K p 

characteristic string of a. We call p to the stack string of 
p o> # and a stack string of a, and we call j3 an input string 
of a. 

A characteristic string of a is, in essence, a summary of information about 
a useful for canonical parsing. It indicates that there exists a canonical 
derivation of Of in which it is immediately preceded by another form a' 
which can be formed as follows: remove from the end of the stack string 
pa: the substring co which matches the right part of production p, 
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replace oo with the left part A, and concatenate the result with the input 
string j8. We describe this procedure as "making a reduction" via 
application of the "applicable production" to the end of the stack string. 
In concert with this terminology we often refer to productions as 
reductions , visualizing them written CO- A. 

As examples we give several canonical forms of grammar G. 
with corresponding characteristic strings: 

7 ?1 = }- i t i + i-| h# 5 

|-P \ i+ i H h Pt i# 5 

|-PtP+i-| |-Pt p # 4 

Theorem 2. 1 : A CF grammar G is unambiguous if and 

only if each canonical form a of G, except S, has a 

unique characteristic string. 

Proof: We exclude a - S because we defined no 

characteristic string for it. Clearly S has a unique 

canonical derivation so the exclusion does not effect 

the following. 

if part: To prove G is unambiguous we must show that 

every canonical form has a unique canonical parse. We 

proceed by induction, letting P be the proposition that 

every canonical form derived in n steps has a unique 
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canonical parse. P is true, because there is only 

one derivation consisting of one step, namely 

S - \- S' A . For some n> we assume that P 

n 

is true and prove P „ . Consider a form a derived 

n+1 

in n+1 steps and having a unique characteristic string. 

Every canonical derivation of a must end in the same 

step a' -* a, for some a' derivable in n steps, by 

definition of characteristic strings. Thus, any canonical 

parse of a must be the production applied in a ] -* a, 

followed by some canonical parse of a'. But a 1 has 

only one such parse, by the inductive hypothesis, so 

a has only one canonical parse. Thus, G is unambiguous 

by definition. 

only if part: If G is unambiguous, each such a has a 

unique canonical parse and derivation. Therefore by 

definition it can have only one characteristic string. Q. E. D. 

2. 3 A Canonical Parser 

Our canonical parser is described simply as follows. Commencing 
with string r\ in L(G), iteratively (i) determine a characteristic string of 
the current canonical form, (ii) output the production indicated by the last 
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symbol of that characteristic string, (iii) make the corresponding reduction, 
and (iv) stop when the new canonical form is a = S. 

Several comments are in order with regard to this algorithm. First, 
it is incomplete since we have not stated how to determine characteristic 
strings. We investigate this problem thoroughly in Chapters 3, 4, and 5, 
but we solve it there only for a restricted class of CF grammars which we 
are about to define. Second, since these special grammars are all unambigu- 
ous, we can change part (i) to read "determine the characteristic string ..." 
Thus, the algorithm is well defined, and deterministic, for the grammars 
of interest. Third, since each iteration is the reverse of a step in a 
canonical derivation, it is clear that the process as a whole is just the reverse 
of a canonical derivation. Thus, the parser proceeds strictly from left to 
right, except perhaps for the computation required to determine characteristic 
strings. This is, of course, precisely why we are interested in this particular 
parser. 

A determination of the canonical parse (5, 5, 4, 3, 2, 5, 4, 1, 0) of the 
string 7) derived above is exemplified below, where we underline the 
reducible substring in each canonical form and characteristic string. 



1~ i * i + H 


I- P t i_ + i -| 


|- P t P+ i -| 


\- P * T + i H 


I-T+ i-l 


|-E+iH 


I-e + pH 


1- E + T -1 


hEi' 
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Canonic ai Form Characteristic Stri ng Output 

!- i #_ 5 

\- P * i # 5 

— 5 

(- Pt P # . 4 

1 - 4 

1- P t T # 3 

— ! 3 

Y T # 2 2 

hEti I 5 

— 5 

h E + P # 4 

r E+T # 1 

I-E^ # Q 



We now informally prove that our canonical parser operates as 
desired for the purposes of compiling; i. e. , that when it is applied to a string 
X] in E(G) it outputs the canonical parse of r? and stops, and that when it is 
applied to a string r> ' not in E(G) it aborts somehow after a finite time. The 
former follows from the fact the parser executes the reverse of a canonical 
derivation. The latter depends on the fact no canonical derivation exists for 
any such string r\\ and on the following two assumptions. First, we assume 
that there is an auxiliary mechanism, a "loader" program, say, which ehecks 
all strings presented to the parser and ensures that the first and last symbols 
are |- and -\ , respectively. Second, we assume that whatever device is used 
to determine characteristic strings never looks to the left of !~ or to the right 
of -| . The way the parser must abort, then, is by determining that there is 
no characteristic string for the string from h to H , inclusive. It is a finite 
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task to determine that 77' of length n has no characteristic string because, if 
nothing else, we could simply generate all strings of length n using G and 
determine that 77' is not one of these strings. Of course, the exact way in 
which our parser aborts will not become clear until we develop a device for 
determining characteristic strings. 

2. 4 LR(k) Grammars 

In hopes of being able to develop practical parsers for them, we now 

restrict our attention to those CF grammars whose sentences can be parsed 

deterministically during a single scan from left to right. 

Definition 2.2. Let k be a non-negative integer. A CF 
grammar G is LR(k) if and only if every canonical form 
Ot = <pB of G, except a = S, has a unique characteristic 
string p# which can be determined by investigating 
only p and k:/8. 

The original definition of LR(k) grammars appeared in (Knu 65). A definition 
very like our own can be found in (H&U 69). 

Theorem 2. 2. An LR(k) grammar is unambiguous. 

Proof: The uniqueness of characteristic strings in conjunction 

with Theorem 2. 1 proves this. Q. E. D. 

We have already seen that our canonical parser proceeds strictly 
from left to right as far as the making of reductions is concerned. The 
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implication of our definition is that for LR(k) grammars the process for 
determining characteristic strings need never get more than k symbols 
ahead of the reduction process. Further, if sufficient information can 
be remembered about the string already processed, no rescanning of that 
string is necessary and the parser as a whole may proceed from left to 
right, except when the process for determining characteristic strings 
peeks ahead as many as k symbols. We show in Chapters 3, 4, and 5, that 
sufficient information can be remembered via a finite number of machine 
states and a pushdown stack, and in fact, that our parser is equivalent to a 
DPDA. 

We emphasize that the LR(k) definition allows parsing decisions to 
depend on arbitrarily large left context ((p) but only on finite right context 
(k:/3). Thus it defines the largest possible set of grammars consistent with 

our deterministic, left-to-right bent. This because no additional information 
about the parsing decisions which were made to reduce the left part of the 
original string to to would be of any use in making new decisions, since we 
are concerned only with context-free grammars. In other words, none of 
the "substructure" associated with <p is relevant to any future parsing 
decisions. 

As an example of an LR(0) grammar, consider G_ whose productions 
follow. 
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(0) . S - (" E -| (4) E- b B 

(1) E- a A (5) B- c B 

(2) A- c A (6) B- d 

(3) A- d 

Because G„ is small and simple it is easy to confirm that it is, indeed, LR(0) 
The canonical forms of G_ are indicated in the following two derivations, 
where n > 0: 



S-|-E-l-(-aAi-...-|-a c n A -j - \ a. C n d -) 
S- |-E-|- (-bB-j-...- |-b c n B -| - h b c n d H 



Since these represent all possible derivations, it is easy to see from 
definition 2. 1 that the corresponding characteristic strings are unique, and 
as follows: 



hE O . 1-aAfj hac n A# 2 , \- a c n d # 3 

hE -| # , h b B # 4 , . . . , I- b c n B # 5 , ha c n d # 6 

Further, it can easily be determined by exhaustive testing that the charac- 
teristic string <p# of each canonical form a =(pfi can be determined without 
regard to any right context (#). Thus, G_ is LR(0). We shall prove this in 
a more satisfying way in Chapter 3. 
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Now, because G Q is. LR(0), we can generate a parser for it via 
the simplest version of our technique, as we shall see. However, of all 
the parser -generating techniques discussed in (F&G 68), Knuth's is the 
only one which covers G_. This is because the most general of the other 
techniques covers only the "bounded right context" grammars (Flo 64); 
i. e. , grammars whose sentences can be parsed from left to right with 
no decisions depending on more than a bounded amount of left or right 
context. 

To see that G is not bounded right context, consider the string 
r) = J- a c n d -| . To parse 77 the reduction which must be made first is 
d - A. But that decision depends on the fact there is an "a" arbitrarily 
far to the left. Had the "a" been "b" instead, the applicable reduction would 
have been d -♦ B. 

We illustrate that our previous example grammar G. is not 
LR(0) by exhibiting two similar canonical forms of G which have distinct 
characteristic strings: 

Canonical Forms Characteristic Strings Reductions 

(- P+H l"P # 4 < 4) P ~* T 

J- PM -| [ Pti# 5 (5) i - P 
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Given the canonical form a = f- P + i H , we could not conclude on the basis 
of the prefix M |-P" alone that the characteristic string of a is "|-P# ". 
We should have to look one symbol ahead to be sure a. is not "}-P t i-| " 
or the like. Because elimination of such uncertainties as these can always 
be effected by a look-ahead of one symbol, G 1 is an LR(1) grammar. We 
prove this in Chapter 4. 

Of course, our parser need not look ahead in unambiguous situations. 
For instance, there is never any uncertainty about whether "i" should be 
reduced to "P" for grammar G 1 , no matter what the context. This fact 
illustrates that the smallest k for which a grammar is LR(k) is limited by 
the worst case of necessary look-ahead. 

As examples of grammars which are not LR(k) for any k > 0, we 
could choose any ambiguous grammar. The violation of Theorem 2. 2 is 
immediate. Neither is the mirror image of grammar G Q LR(k). This is 
because the "d" would now appear on the left end of each sentence, and we 
would need arbitrary right context to choose between the reductions d -♦ A 
and d -» B. 

This latter case suggests the concept of RL(k) grammars, whose 
sentences can be parsed deterministically from right to left. We do not 
pursue this concept further since the generalization is obvious. 
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2.5 The Meaning of the LR(k) Condition 

We emphasize the fact that the LR(k) condition is one on the grammar, 
not the language. For instance, the grammar S -• |-E-j, E -* aEa E -* a 
is not LR(k) for any k, but the language it generates |-a [aa] -j is regular, 
and therefore recognizable by an FSM. The grammar not being LR(k) 
corresponds to the fact the strings cannot be parsed by a DPDA. There 
does, however, exist an LR(k) grammar which generates the same language. 
In fact,Knuth has shown that there exists an LR(k) grammar for every 
deterministic language; i. e. , every language which can be recognized by a 
DPDA has a grammar such that the sentences can be parsed by a DPDA. 

The latter fact is only of somewhat academic interest from our point 
of view because we are ultimately interested in using grammars to specify 
translations from strings into structures, so we are as interested in the 
structural properties of grammars as we are in the languages they generate. 
The case just given is one where no LR(k) grammar exists which has the 
symmetrical structural property of the original grammar. This corresponds 
to the fact that no DPDA could determine the center of an arbitrarily long 
string without looking arbitrarily far ahead to find the end of the string. 

It is also of some academic interest that any "LR(k) language", i. e. 
one generated by an LR(k) grammar, can also be generated by an LR(0) grammar. 



t 
In(Knu 65) the result is that there is an LR(1) grammar for each LR(k) language, 

but this is because Knuth does not assume the left and right "pad" symbols to be 

built into the grammar. One- symbol look-ahead is therefore necessary to detect 

the end of the string. 
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This fact is another which is not very interesting from our viewpoint because 
it has not been shown, and indeed, we suspect that it is not true, that an LR(0) 
grammar exists which is "structurally equivalent" to the original grammar. 
(See (Che 67) for a precise definition of the "structural equivalence" of 
grammars. ) 

2. 6 Terminology in Automata Theory 

The following is intended only as a review of terminology, since we 
assume that the reader is already familiar with the concepts. However, the 
reader should pay special attention to the discussion of DPDAs, because our 
representations of them are unusual. We first discuss a link between formal 
grammars and automata theory. 

A production is said to be right linear (Gin 66) if it is of the form 
A - COB or A - CO, where A and B are in V N and CO is in V_ . A CF grammar 
is called right linear if all of its productions are right linear. A right linear 
grammar G R is said to generate a regular language , and it is well known 
that the latter can be recognized by an FSM which can be derived from G 
(H&U 69). 

FSMs . Formally, an FSM (Hen 68) is an abstract model consisting 
of a finite set of input symbols , a finite set of output symbols , a finite set 
of states , a next- state function , and an output function. For our purposes 
an FSM need only be a recognizer, so the output symbols need include only 
"l" and "0", or "yes" and "no". We consider an FSM to be synonymous 
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with one of its representations, namely a "transition graph", and we discuss 
FSMs in terms of the latter rather than in terms of the above five components. 

A transition graph consists of a set of nodes with various arrows 
drawn between them. Each node represents a state and is indicated thus 
N| , where N is the name of the state (we use integers for state-names). 
Each arrow is labeled with an input symbol s; it is said to be a transition 
under s, or simply an s- transition, and it represents an element of the next - 
state function. A starting state is indicated by a short incoming arrow which 
originates on no node of the graph. A terminal state is indicated thus 
An example of an FSM (transition graph) is as follows. 



□ 




A series of transitions leading through an FSM from state KL, to 
state N . . . to state N, is called a path from INL to N . Every such path 
spells out a unique string of input symbols (i. e. , an input string ) in the 
obvious way. An FSM accepts a given string r\ if and only if there exists 
at least one path that begins at a starting state, spells out r\, and ends at 
a terminal state. The set of all strings accepted by an FSM is referred to 
as the set that is recognized by that FSM. 
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A state M is said to be accessible from state N if and only if there is 
a path from N to M; the input string spelled out by such a path is said to 
access M from N. When the initial state is not specified, it is understood 
to be a starting state. 

If we associate the output symbol "l" with each terminal state and 
"0" with each of the others, each path also spells out a unique string of 
output symbols (i. e. , an output string) . States M and N are said to be 
equivalent if and only if for each input string n spelled out by some path 
from M (N), such that the path also spells out the output string 17 ', there 
exists a path from N (M) which spells out the same two strings 17 and 77', 
respectively. 

An FSM is said to be deterministic if and only if it has a single 
starting state and from each state there is at most one transition under each 
distinct input symbol; otherwise, it is said to be nondeterministic. A deter- 
ministic FSM is said to be reduced if and only if every state is accessible 
from the starting state, some terminal state is accessible from every state, 
and no two states are equivalent. A reduced machine is unique within the 
names of its states, and, since it is a homomorphic image of other machines 
which recognize the same set, it can in a real sense be thought of as minimal. 

We often think of a deterministic FSM as a physical machine, rather 
than as an abstract model, and this leads to the following terminology. To 
determine if a given FSM accepts a given string tj, we say that we initialize 
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the machine (i. e. , start it in its starting state), apply it to rj, and determine 
if 17 takes the machine through a sequence of states to a terminal state. The 
machine is said to read the symbols in tj from an input tape, to enter first one 
state and then the next, and to output symbols onto an output tape . If after reading 
the last symbol of 77 the machine outputs a "l", then it accepts r). However, 
if at that time it outputs a "0" or if it stops reading before it reaches the 
end of 77, it does not accept 17. The machine stops reading whenever it enters 
a state with no transition under the next symbol to be read. 

DPDAs. Our treatment of DPDAs is less formal than that of FSMs. 
For our purposes a DPDA is a machine consisting of an input tape, an 
output tape , a finite control , and a pushdown stack. 

The finite control can be thought of as a program consisting of 
instructions pertaining to the reading of symbols from the input tape and 
the outputting of symbols onto the output tape, the storage, interrogation, 
and removal of items on the stack, and jumps from one point in the program 
to another. The control can be represented by a transition graph whose 
nodes (we use circular nodes for DPDAs) are called states and whose labeled 
arrows are called transitions. 

Each state represents a point in the program which can be jumped to, 
and it has a name which is given inside the node. There is a unique starting 
state , indicated thus r\ and a unique terminal state , indicated thus 
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Each transition implies one of four kinds of instructions , the 
interpretations of which are indicated next. If the machine enters state 
N having a transition to state M, then, if the label of the transition is 
(1) a symbol s, the machine reads the next symbol and, if the symbol 
read is s, it then enters state M, (2) "push i", the machine pushes the 
item i on the stack and then enters state M, (3) "pop n, out p", the machine 
pops the top n items off the stack, outputs p, and then enters state M, or 
(4) "top i", the machine compares item i with the top item on the stack, 
and, if they are the same, it then enters state M. 

The following two conditions are sufficient to guarantee determinism : 
(1) any state having a transition under either "push i" or "pop n, out p" may 
have no other transitions, and (2) any other state must have either every 
transition under a symbol, or every one under "top i" for some item i. 

The initial configuration of a DPDA is as follows. It is started in 
its starting state with the input string (the string to be parsed, in our case) 
on its input tape, with its input head (reading device) over the leftmost 
symbol (|-) of the input string, and with its stack empty. The final configuration 



Our special application of DPDAs has prompted us to depart from the usual 
restrictions (DAD 69) of allowing "pops" of only one symbol at a time from the 
stack, and investigations of items on the stack only when popping them off. 
Also, outputs are usually associated with states, as in the case of FSMs. We 
believe it is obvious how to modify our DPDAs to abide by these restrictions. 
We have deviated from the norm for the sake of simplicity and practicality. 
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is: the input head one place to the right of the rightmost symbol (-| ) of the 
input string, the stack empty, and the machine in its terminal state. 

The similarity of DPDAs and FSMs is emphasized if we note that 
a DPDA which never uses its stack is equivalent to some FSM. This leads 
us to think of a DPDA, then, as being based on some FSM. We think of this 
FSM as reading symbols, as usual, but interspersed between some of the 
reads are some "bookkeeping"operations involving the stack, and these 
operations effect some of the state changes of the FSM. This viewpoint proves 
to be quite useful in Chapters 3, 4, and 5. 
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Chapter 3 
PARSERS FOR LR(O) GRAMMARS 

3. 1 Perspective 

Chapters 3, 4, and 5 are difficult ones to read because they contain 
many detailed definitions, lemmas, theorems and corollaries, and intricate 
proofs. But alas, the difficulties cannot be circumvented entirely because 
the material covered is fundamental to the dissertation and must be precise 
and proven, and because it is distinctly nontrivial. We can, however, 
minimize problems by providing perspective via an informal preview of 
the results to come. 

The objective of the present chapter is merely to show how to construct 
parsers for LR(0) grammars, but in the process we lay a foundation upon 
which we ultimately build to cover all LR(k) grammars. 

We begin by showing that the set of characteristic strings of a 
given CF grammar G is a regular language. Thus, the set can be recognized 
by an FSM. We next show that if G is LR(0) the reduced, deterministic FSM 
which does this recognition is adequate, without modification, for use in 
parsing. In particular, the FSM can be used to determine characteristic 
strings of canonical forms, as is necessitated by our parsing algorithm. 
It follows rather directly that the parsing algorithm as a whole can be 
converted to a DPDA, the finite control of which can be derived directly 
from the FSM. 
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In Chapter 4 we define the "Simple LR(k)" grammars; i. e. , those 
grammars for which the special FSMs can be used to determine charac- 
teristic strings if they are extended by the addition of certain "look-ahead" 
information which can be computed in a simple way. The conversion of the 
modified FSM to a DPDA is straightforward, simply resulting in a DPDA 
with "look-ahead". 

In Chapter 5 we address the problem of constructing parsers for 
general LR(k) grammars. We find that in some of these cases the modification 
needed for the FSM is the same as above, but that the "look-ahead" information 
is more difficult to compute than for the "Simple LR(k) n grammars. In 
the general LR(k) case, however, some of the states of the FSM must be 
split into several copies because of complex correspondences between left 
and right contexts. The state splitting process is explained simply as 
"building into the machine" the capability to remember more left context 
so that the corresponding right contexts can be checked to make parsing 
decisions. Thus, the construction of the parser in the general case can 
become computationally complex. 

In conclusion, what we develop in the next three chapters is a 
method for constructing parsers which grows in complexity as it discovers 
the complexity of the grammar it is working on. That is, we first assume 
the grammar is LR(0) and set out to generate a parser for it. In the 
process of constructing the parser we are able to determine if the grammar 
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is, indeed, LR(O). If it is, we complete our construction and are finished. 
However, if the grammar is not LR(0), we assume it is "Simple LR(k)" 
and compute the "look-ahead" information in a simple way. If certain 
conditions do not hold regarding this "look-ahead", we use more complex 
methods and perhaps discover that some state splitting is necessary. 
Ultimately, we are able to determine if a given grammar is LR(k) for 
any finite value of k given a priori , and if it is, we can construct a 
parser for it. 
3. 2 Foundation . 

To complete the specification of our canonical parser we develop 
an automaton which is capable of determining characteristic strings. We 
first concentrate on LR(0) grammars and then gradually generalize to 
include all LR(k) grammars. The following theorem, regarding both 
amgibuous and unambiguous grammars, is fundamental to our development. 

Theorem 3. 1. The set of characteristic strings of a 

given CF grammar G = (V , V , S, P) is a regular 

language. 

Proof: Consider a canonical derivation of some 

canonical form a: 



+ 
Knuth (Knu 65) has shown that it is undecidable, in general, whether a grammar 

is LR(k) if k is not given a priori. 
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olco. ... co A co . . . co'co" - 
01 mmm 10 . , v A x 

( (p) A m - co) 

co_co. ... co co co" . . . co''co" = a 
1 m m 10 



where m > 0, A -♦ co is the p-th production in P, and 

for 0< i < m each co! ' is in V^, eachco. and co'. is in V 
— i T i i 

(recall that V = V m u V,. T ), and each A. -co. , . A , ,co'. , . is 

T N x x+1 l+l l+l 

a production in P. Then a characteristic string of a is 

CO-C0. ... co co# . 
1 m p 

This string can be generated by a grammar containing 
the right linear productions: 



S' -» co A' 

A' -» co A 1 
11 



A* - co# 
m p 



-43- 



where S' and the A! are the nonterminals in this grammar. 
Generalizing, we see that the following right linear grammar 
generates all possible characteristic strings of G: 

F' = (YJ,, V^, S\ P') 
where 

V.J, = V V { # ,# x # g } and 

V* = f A 1 | A is in V } and 

S 1 = S primed and 

pi = { a 1 -* co# | A-*co is the p-th production in P} 
U { A'-w B'|A-u> Boo is in P and B is in V ] 

Further, because there are no useless productions in 
G there corresponds to each derivation of a string 
cp # using grammar F' derivations using grammar 
G of one or more canonical forms, each of which 
has <p# as a characteristic string. Thus, the grammar 
F' generates all and only the characteristic strings of 
G. 

Finally, F' generates a regular language because 
it is a right linear grammar. Q. E. D. 
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Definition 3.1. The grammar F' of the proof of Theorem 
3. 1 is called the characteristic grammar of G. 

As an example we present the productions of the 
characteristic grammar of our example grammar G~ (see page 29): 

(0) S'->f-E-| # Q (6) A'- d# 3 

(1) S'- HE' (?) E 1 - b B# 



4 



(2) E'-> a A# (8) E»- b B' 



(3) E 1 - a A' (9) B 1 - c B # 5 

(4) A'- c A# 2 (10) B'- c B» 

(5) A'- c A' (11) B'- d# 6 

3 . 3 CFSMs : Characteristic FSMs 

We now concentrate on a particular FSM which can be derived from 
a characteristic grammar. 

Definition 3. 2 . A CFSM ( characteristic FSM ) of a CF 

grammar G is a reduced, deterministic FSM which 

recognizes the set of characteristic strings of G. 
Since any such FSM is unique within the names of its states we refer to the 
C.F^SM of G. The CFSM can be derived from the characteristic grammar 
of G via well known techniques (see for example, (H&U 69) page 33) or it 
can be derived directly grom G, as we discuss in detail in Section 7.1. 
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We illustrate in Figure 3.1 the CFSM of our LR(0) grammar G Q . 
It is the CFSM which is capable of determining the characteristic strings 
of canonical forms for an LR(0) grammar and an extension of it which is 
so capable for an LR(k) grammar. However, the proofs of these state- 
ments require several preliminary results. 

In the sequel we use # -transition to mean a transition under a 
#- symbol. 

Lemma 3.2 . Several properties of the CFSM of a CF 

grammar G are as follows: (1) it has a single starting 

state, (ii) every state is accessible from the starting 

state, (iii) every #- transition is to a unique terminal 

state T, such that there are none other than #- transitions 

to T and such that there are no transitions from T, and 

(iv) the terminal state is accessible from every other 

state. 

Proof: (i) the machine is deterministic, (ii) the machine 

is reduced, (iii) every string accepted by the machine has 

exactly one #- symbol and it is the last symbol in the 

string; thus, any terminal state must have none other than 

#-transitions to it.and it must not have any transitions from 

it; there is a unique terminal state because the machine 

is reduced, and (iv) the machine is reduced. Q. E. D. 
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Figure 3.1. The characteristic FSM of our example grammar 



G 



(0) S - |- E -f, (1) E - a A, (2) A - c A, (3) A - d, 



appears 




(4) E - b B, (5) B - c B, (6) B - d. Although [g 

at several locations above, it is to be taken as the unique 

terminal state. 
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Now we give convenient and, as we shall see presently, meaningful 
names to all the states of such an FSM, except for the terminal state. 

Definition 3. 3 Any state having no #- transitions is called 

a read state 

Definition 3. 4 Any state whose only transition is a 

#-transition is called a reduce state. 

Definition 3. 5 Any state having two or more transitions 

at least one of which is a # -transition, is called an 

inadequate state . In the case of a state with more 

than one #- transition, we sometimes refer to it 

as multiply inadequate . 
The lattermost definition motivates the following one. 

Definition 3. 6 A CFSM with no inadequate states is 

said to be adequate , otherwise it is said to inadequate. 

3. 4 Parsers for LR(0) Grammars 

Preliminaries . The following lemma is a concise and useful statement 
of the LR(k) condition specialized to the case k = 0. It provides a way to 
decide if a grammar G is LR(0) by checking properties of its characteristic 
strings, rather than of its canonical forms. This is a decided advantage. 
Informally, the lemma means that, if the stack string of one characteristic string 
is a prefix of another characteristic string, then G is not LR(0). 
Lemma 3. 3 . Let G be a CF grammar. Letipjf and 

<p # be any two characteristic strings of G such that<p 1 =^ 2 =<p. 
2 q 

Then G is LR(0) if and only if = € and q = p. 
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Proof: Our proof depends on the fact that by definition of 

characteristic strings there correspond top# and 

co0# canonical forms a. =<p8. and a„ =<pdB n , 
q 11 2^2 

respectively, for some /J and /8_ in V T . 
if part: If = c , and q = p, then ot and a„ have the 
same characteristic string <pf . Consider the case 
a 1 = a = a. This implies every canonical form a 
has a unique characteristic string. Consider the case 
a f a. . if we were given a alleged to be either a., or a , 
we could determine the characteristic string <pf of a by 
investigating only p. Since a 1 and a„ can be any canonical forms 
as given above, we have shown that G is LR(O) by definition. 
only if part: If G is LR(0) then, if a t = a = a, we 
must have 6 = e and q = p, since each canonical form 
a has a unique characteristic string. If O' t a„ and 
if 6 = c and/or q i p, then a and a_ have distinct 
characteristic strings, and given a alleged to be 
either a., or a , we could not determine the charac- 
teristic string of a on the basis ofp alone. Since this 
is a contradiction of the LR(k) definition for k = 0, 
we again have 6 = € and q = p. Q. E. D. 
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We use this lemma immediately to verify another and still more 
useful method for deciding if a grammar is LR(0). 

Theorem 3. 4 . A CF grammar G is LR(0) if and only 
if its CFSM is adequate. 

Proof: if part: If the CFSM is adequate and if it 
accepts the string p# , then it cannot accept the string 
<p0# for 6 = c and/or q = p. For if it did, the state 

q 

accessed by <p would be inadequate, having distinct 

transitions under # and 1:0# . G is therefore LR(0) 

P q 

by the "if part" of Lemma 3-3. 

only if part: By Lemma 3. 2, part (ii), each state N 

of G's CFSM is accessible by some string <p. Assume 

that N has a # -transition; i. e. , that the CFSM accepts 
P 

<p# . If N had another distinct transition, it would be 
P 

either to the terminal state or to a state from which the 

terminal state is accessible, by Lemma 3. 2, part (iv). 

Thus, the machine would also accept w0# for some 

q 

6 f € and/or q t p. But by the "only if part" of Lemma 

3. 3, <pf and<p0# cannot both be characteristic strings, 

i. e. the CFSM cannot accept both, unless 6 = c and 

q = p. Thus, any such N must have only the # -transition, 

and the CFSM is adequate by definition. Q. E. D. 
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Thus, we have proved that our example grammar G is LR(0) by 
exhibiting its CFSM (Figure 3.1), which is adequate by inspection. 

Parsers . We now prove that for the special case of an LR(0) 
grammar, the corresponding CFSM is capable of determining the charac- 
teristic strings of canonical forms. 

Theorem 3. 5. Let G be an LR(0) grammar and a = <p/3 

be a canonical form of G with characteristic string <p# . 

The stack string tp accesses a reduced state of G's 

CFSM whose only transition is under # . 

Proof: The CFSM accepts the string <p# . Thus, 

(p accesses a state N with a transition under # . 

P 

But since G is FR(0), Theorem 3. 4 implies N is 

not an adequate state. Therefore, N must be a 

reduce state whose only transition is under # . Q. E. D. 

Parsing algorithm . Thus for an LR(0) grammar G our parsing 

algorithm can be restated as follows. Commencing with a =77, where 

Tj is a string in L(G), and with the CFSM of G: 

(i) Initialize the CFSM and apply it to the current canonical form a. 

When the machine enters a reduce state R, it will have read the stack 

strings of a. and will have left to read the input string j3 of a. 

(ii) The only transition from R must be under # for some production 

P 

p, so output p. 
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(iii) Apply reduction p to the end of <p and concatenate the result and 

/3 to form the next canonical form a. 

(v) If the new form is a - S then stop; otherwise start at step (i) again. 

Note that in this algorithm characteristic strings are determined 
without checking the entire string. Thus, in general, when it is applied 
to a string 77' not in L(G), it goes through several iterations, making 
reductions on the left part of 77', but it ultimately aborts when the CFSM 
is applied to a string a' = <p '/3 ' such that<p' accesses a state with no 
transition under l:/3'; i. e. , when the CFSM stops reading. This must 
be the case because there is no other way for the algorithm to fail, and 
because if it were successful, that would imply there exists a canonical 
parse of 77'. (Recall the discussion at the end of Section 2. 3. ) 

Obviously this parser is neither efficient nor strictly left-to-right 
since it starts back at the beginning of the stack string at each iteration. 
We now solve these two problems by converting our string-manipulation 
algorithm to a DPDA. 

3. 5 Conversion of the Parsers to DPDAs 



Our conversion technique is most easily understood if it is presented 
in two steps. We first convert our parser to a "stack algorithm"; i. e. , 
an algorithm incorporating a pushdown stack. The use of the stack eliminates 
the need for rescanning the stack string at each iteration. Then we give a 



SjBJflpiBSi'-iiVP 



-52- 

technique for converting the CFSM to the finite control of a DPDA, such that 
the DPDA simulates the stack algorithm. 

Consider an iteration of our parsing algorithm. We begin with some 
canonical form a* = paj/3 whose characteristic string is po# . We apply 
the CFSM to a 1 . The string pui is read, the CFSM enters a reduce state, 
and the characteristic string is determined. If production p is A -* CO, we 
replace co with A to form a. = pAfi and start anew. 

Now, on the next iteration the first action of the CFSM is to read 
p again. But the CFSM is deterministic and will therefore go through the 
same sequence of states while reading p this time as it did on the previous 
step. Thus, had we remembered in the previous step the state N of the 
CFSM immediately after reading p, we could in this step merely start the 
CFSM in N and apply it to A/3 to get the desired result. 

The stack algorithm: To eliminate the rescanning of the stack string 

at each iteration we use a pushdown stack. As the CFSM reads a canonical 

form we push onto the stack the names of the states entered by the CFSM. 

Upon determining the characteristic string, say pco# where production p 

P 

is A -"to, we pop the top |to| state-names off the stack and output p. We 
then return the CFSM to the state whose name is at the top of the stack 
(determining the top name is called looking back ) and continue the process 
by reading Aj3. The process ends when the string to be read is simply S. 
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It should be clear, in light of the two paragraphs preceding, the 
algorithm, that the stack algorithm is equivalent in effect to our previous 
algorithm. However, it is more efficient than the previous one. 

We emphasize , for reasons which will become apparent shortly, that 
the sequence of state-names stored in the stack at a particular time T 
represents a path through the CFSM. The path is the one which would be 
taken by the CFSM were it to be applied to the prefix which is implicitly 
the left context at time T. This property is the basis of several observations 
which we make below. 

Note that at this stage we have substantially departed from the string- 
manipulation notions with which we began. Our stack algorithm has no 
further interactions with symbols after ithas read them. Instead, it interacts 
with the state-names of the CFSM. We now move another step away from our 
original parsing notions by coverting the stack algorithm plus CFSM to a 
DPDA. 

The conversion technique. We consider the CFSM to be the basis from 
which we construct the finite control of our DPDA. Since both FSMs and 
finite controls can be represented by transition graphs, the technique can 
be described as a piecewise conversion of one graph into another. 

We think of the CFSM- graph as a skeletal program which we must 
convert to a detailed program (finite control) by filling in more instructions. 
The basic structure and the read instructions are already in the program, 
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and we must add the stack-manipulation instructions. Our guide to this 
programming task is, of course, the stack algorithm. 

For each state N of the CFSM there is a state named N in the DPDA, 
such that the actions of the DPDA immediately subsequent to entering state 
N are similar to the actions of the stack algorithm when the CFSM is in 
state N. The CFSM can be converted to the appropriate finite control by 
applying to it the three transformations indicated in Figure 3. 2. 

Figure 3. 2a indicates a transformation for replacing #-transitions 
with "reduction procedures". Consider a reduce state R corresponding 
to production p, A -» co. We replace the # -transition from R with a 
transition under "pop |co|, out p" to a new look- back state R". There is 
one transition from R 1 under"top N" to state M for each pair (N, M) in 
the set Q, where Q = {(N, M)| there exists an A-transition from N to M 
and a path from N to R which spells out to} . 

Note that there is an optimization implicit in this transformation. The 
reduction procedure executed by the stack algorithm can be described via the 
following sequence: "pop |to|, out p"; look back and see N; return to the CFSM 
to state N; read A (which causes the CFSM to enter state M). However, the 

reduction procedure for the DPDA is simply: "pop|co|, out p"; look back and 

the 
see N; enter state M. That is, the DPDA does not manipulate^nonterminal 

A. The optimization might be described as precomputing part of the reduction 

procedure and "wiring the results into the machine". 
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Figure 3.2. Transformations for converting the GFSM of 
an LR(0) grammar G to a DPDA- parser for G. 
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There is one exception to our first transformation. If p = then we 
replace the # -transition from R with one under "pop 4, out 0" to the 
terminal state. This follows because the associated production is known 
to be of the form S -» |- S' -\ ; i. e. , because we know that R is associated 
with the final reduction. If we analyze first the parsing algorithm at the 
end of Section 3. 4 and then the stack algorithm, we see that when our DPDA 
enters state R, the implicit left context must be |- S' -j , and therefore, 
that there must be four state-names in the stack. Thus, "pop 4" empties 
the stack so that the final configuration of the machine will be correct. 

Figure 3. 2b indicates a transformation which causes the DPDA to 
push the same state-names on its stack as the stack algorithm does on its, 
and at the same time. That is, when the DPDA enters state N, it first 
pushes the name N on its stack and then it enters a new state N' where it 
continues doing whatever the stack algorighm would do with the CFSM in 
state N. 

Figure 3. 2c indicates the deletion of all transitions under nonterminals. 
This is possible because of the optimization implicit in Figure 3. 2a and 
because the DPDA is assumed to be parsing only terminal strings. 

In Figure 3. 3 we present the result of applying the first and third of 
our transformations to the graph of Figure 3. 1. We did not apply the second 



t 
However, we believe that, if the transitions under nonterminals were 

retained, the DPDA could parse any sentential form. 
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Figure 3.3. The finite control of the DPDA-parser for our 
example grammar G Q : (0) S -*• (- E -| f (1) E -* a A, (2) A + c A, 
(3) A -* d, (M E * b B, (5) B -*• c B, (6) B - d. This figure 
was derived from Figure 3.1. 
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transi'ormation for two reasons: (1) the figure would have gotten too large 
and unreadable, and (2) in our implementation in Chapter 7 we find it 
efficient to implement "states which push their names on the stack when 
they are entered"; i. e. , we implement ® -~^- s - h - N -H>@ as a single 
state. Thus, [n] can be thought of as an abbreviation for such a state. 

To illustrate the operation of our machines, we indicate in Table 3-1 
the history which results when the DPDA implied by Figure 3. 3 is applied 
to the string |-acd-| in L(G Q ). Note that for perspecuity we indicate at 
each step the symbols of what is implicitly the left context. Of course, 
those symbols are not stored in the stack by the DPDA. 

Comments: A read state of a DPDA is one all of whose transitions 
are under symbols. When a DPDA for an LR(0) grammar G is applied to 
a string 7}' not in L(G), it must abort in a way similar to the way the stack 
algorithm aborts. This follows because the DPDA simulates the stack 
algorithm. In particular, the machine will ultimately enter a read state N 
having no transition under the next symbol to be read. Further, the 
corresponding state N of the CFSM is the one in which the CFSM would abort 
if the stack algorithm were applied to 7)'. 

The only other seemingly possible time that the DPDA could 
abort is when it is in a look-back state. But this possibility is 
ruled out, again because the DPDA simulates the stack algorithm. 
The stack algorithm looks back only to decide in which state to 
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Table 3 1. The history of grammar G ' s DPDA~parser applied to the 



string f- acd -\ in L(G ). 
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restart the CFSM after a reduction. It does not look back to check the 
validity of the information in the stack since, as noted above, that information 
always represents a path through the CFSM. Thus, looks back cannot fail. 
We do not formally prove that our DPDA for a given LR(0) grammar 
G is a correct parser for the sentences of G. Instead, we informally argue 
that the DPDA is equivalent in effect to the stack algorithm, which in turn 
is equivalent to the algorithm at the end of Section 3. 4, which in turn 
is equivalent to our canonical parser that was informally proved to be 
correct in Section 2. 3. We implicitly rely on a similar line of reasoning 
with respect to our parsers throughout the remainder of the dissertation. 

3. 6 Optimizing the DPDAs 

As noted above our DPDAs have already been optimized with respect 
to the stack algorithm. By precomputing part of the reduction procedures, 
we increase both the time- and space-efficiency of our machines. Less 
time is used because the reductions are executed with fewer machine 
operations, and less space is used because transitions under nonterminals 
are unnecessary. There are three more ways in which the DPDAs can be 
optimized and all three are related to look-back in one respect or another. 

(1) Two look-back states R' and R' are said to be equivalent if and 
only if for each transition from R' (R' ) under "top N" to state M there is 
a similar transition from R' (R'). Clearly, equivalent look-back states 
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may be eliminated in favor of a single state, in the obvious way. Note that 
the machine of Figure 3. 3 has already been optimized in this way; e. g. , 
7 and 8 have transitions to the same look-back state. Clearly, the effect 
of this optimization is only to increase space- efficiency. 

(2) Another optimization arises from the fact that we look-back only 
to determine which state to enter after a reduction. Thus, if all the 
transitions from a given look-back state R' are to the same state M, 
then R' is unnecessary. States 15 and 17 of Figure 3. 3 can be eliminated 
due to this property, increasing both the time- and space-efficiency of the 
DPDA. That is, the transitions from states 5 and 10 may by-pass states 
15 and 17, respectively, and go directly to state 2. 

(3) Finally, note that reduce states need not push their names on the 
stack since the names are immediately popped off again without ever being 
interrogated (via a "top R"). Thus, the node 



R 



in the lower part of 



Figure 3. 2a can be changed to (r) , and "pop M " must then be changed 
to "pop | co| - 1". 

In fact, in almost all cases only those states in the set X = {N| 
there is a transition under "top N" in the machine } need push their names 
on the stack; i. e. , be represented by square nodes. Of course the 
"pop |oj] " instructions must be changed accordingly, and thence arise 
the only exceptions to the previous statement. If we follow the path from 
N. to R in Figure 3. 2a, starting with a counter set to zero as we leave N 1 , 



an 
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d increment the counter by one each time we encounter a state in the set X, 
we can reduce "pop |cu| " to "pop n", where n is the value of the counter 
after reaching H. However, the same statement applies to the path from 

N to R , and N to R. Clearly, each path must imply the same n, 

or if this is not the case, some extra states not in X must push their names 
so that the paths are "balanced", in the obvious sense. 

In the case of our DPDA of Figure 3. 3, only states 1, 4, 6, 9, and 
11 (the ones in the corresponding set X) need push their names. The effect 
of this optimization is, of course, to increase both time- and space-efficiency, 
but it also reduces the depth of the stack during execution. 

Comments. To indicate the significance of these optimizations in a 
practical case, we give some statistics relating to our DPDA which is 
presented in Chapter 7. The DPDA corresponds to the grammar of a pro- 
gramming language which is quite practical, syntactically. The optimized 
machine has 172 states. The first optimization reduced the potential number 
of look-back states from 82 to 32. The second optimization further reduced 
the number to 22. The third optimization reduced the number of states 
pushing their names on the stack from 157 to 61 (again only those states 
in the corresponding set X); i. e. , it reduced the depth of the stack during 
execution to about 3/8 of what it would otherwise have been. 

We delay any specific estimates of the time- efficiencies of our 
machines until we have discussed parsers for "Simple LR(k)" grammars, 
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the subject of Chapter 4. The LR(0) grammars are not very interesting 
for our purposes, so the efficiencies of their parsers are also uninteresting. 
However, we find the "Simple LR(J_)" grammars, and therefore their 
parsers, quite interesting, as we shall see. 

We delay discussion of specific space-efficiencies until Chapter 7, 
where we are concerned with implementation issues. Space- efficiency 
is most easily discussed in terms of an actual implementation. 

Regarding implementation issues, the fact that look-back is not for 
validation of information on the stack, also implies two possible optimizations 
when implementing these parsers. (1) If the implementation is sequential 
in nature (as is the one presented below), then, if in all but a few cases 
the transitions from a look-back state R' go to a single state M, the "odd 
balls" may be checked first and, if the top of the stack is not one of them, 
a default transition to M may be made. (2) If the implementation is para- 
llel in nature (e. g. , array or matrix look-ups), then "compatible" look- 
back states may profitably be merged into a single state. For instance, 
in Figure 3. 3 the four look-back states are "compatible" and can be merged 
to form a single state having transitions under "top l" to state 2, "top 4" 
to 5, "top 6" to 7, "top 9" to 10, and "top 11" to 12. (The fact that the 
first number in each case is one less than the second is a "red herring". ) 
We do not pursue the parallel possibilities in the present dissertation, even 
though they have significant potential. 
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Finally, we emphasize that, since all of these optimizations concern 
look-back, they have no effect on error detection. That is, the optimized 
DPDA will detect that its input string is not in L(G), if indeed that is the 
case, at the same time relative to the reading of 77' as would the unoptimized 
DPDA. 

3. 7 Conclusion 



At this point it is advisable that the reader should reread Section 3. 1 
to place the foregoing results into perspective. 

We have now developed much "machinery" for converting CFSMs to 
optimized DPDA parsers. Of course, our results thus far are useful only 
for LR(0) grammars, but we shall see inChapters4 and 5 that with the 
addition of one more transformation rule, namely one relating to "look- 
ahead", we shall have the "machinery" necessary for covering all LR(k) 
grammars. The problem of generating parsers for "Simple LR(k)" 
grammars, then, reduces to that of appropriately adding "look-ahead" 
information to CFSMs, and that for general LR(k) grammars reduces to 
appropriately splitting some states of the CFSMs and then adding "look- 
ahead" information. 
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Chapter 4 
PARSERS FOR SIMPLE LR(k) GRAMMARS 

We now investigate a class of grammars which is of substantial 
interest from the viewpoint of programming- language design and speci- 
fication. The class is a subset of the LR(k) grammars for which parsers 
are only slightly more difficult to construct than are parsers for LR(0) 
grammars. The class includes the LR(0) grammars, and the accompanying 
parser-constructing technique is based on our LR(0) technique. 

We begin by discussing the nature of the "inadequacy" of CFSMs for 
non-LR(O) grammars and a solution for that "inadequacy". 

4. 1 Inadequacy , Look- ahead 

In the case of a grammar G which is not LR(0), Lemma 3. 3 implies 
that G has at least one pair of characteristic strings of the form <p# and 
(p0# such that p ^ q and/or 6 + e. By definition of characteristic strings, 
then, there exist canonical forms a =<Pfi* and a = <p /3 which have the 

XX £i Ci 

characteristic strings <p# andp0# , respectively. 

Assume that we attempt to use G's CFSM to determine the charac- 
teristic string of a form a alleged to be either a or a . If we apply the 

X Ct 

CFSM to a, it reads <p and enters a state having distinct transitions under 
# and 1:0# (recall the proof of Theorem 3. 4); i. e. , the machine enters 
an inadequate state. What do we do then? 
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If a = a then we should stop and apply reduction p to the end of <p. 
However, if a = a_ then, if 8 f e, we should allow the CFSM to continue 
reading, whereas if d = e, we should stop and apply reduction q to the end 
of (p. The problem is that there is not a unique parsing decision associated 
with an inadequate state, as is the case with a read or reduce state. 

Stated another way, the state, and therefore the CFSM, are indeed 
"inadequate" for use in determining characteristic strings. However, the 
LR(k) definition itself hints at a solution to this inadequacy. By using the 
CFSM we have, in effect, investigated and remembered some pertinent 
features of the left context cp. However, we have not investigated the right 

context at all; i. e. , we have not looked ahead of the decision point. 

t 
Let us consider an example. There follow the productions of 

the characteristic grammar of our example grammar G^ (page 19). 



t Note that the production E' - E' makes the grammar "infinitely ambiguous ; 
i. e. , each sentence has infinitely many canonical parses. This is of no ^ 

concern to us here because we are not interested in the "structural properties 
of the grammar. We are only interested in the strings which the grammar 
generates and the CFSM which accepts them. 
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(0) S' - l-EH # q (7)T> - PtT# 3 

(1) S' - |"E' (8) T' - PtT' 

(2) E 1 - E + T# (9) T 1 - P' 

(3) E' -» E + T 1 (10) T' - P# 4 

(4) E' - E' (11) P' - i# 5 

(5) E' - T# 2 (12) P' - ( E )# 6 

(6) E' - T' (13) P' -( E 1 

The corresponding CFSM is illustrated in Figure 4. 1. For our purposes 
here the only state of interest is the inadequate one, state 7. 

Consider the two canonical forms of G. tt. = |-P + H and a^ = |-Pti-|. 
The unique characteristic strings of a and a are f-P # and |-Pti #,., 
respectively, as the reader may easily confirm by canonically deriving the 
forms. Clearly, the prefix |-P, which is common to a and a.^ , accesses 
state 7 of G 's CFSM. 

Now, if we were given a alleged to be either a or a , we could 
determine a's characteristic string as follows. First, we apply G^s CFSM 
to a. Then, when the CFSM enters state 7, we look ahead at, but do not let 
the CFSM try to read, the next symbol to be read. If the symbol is +, then 
the characteristic string is the prefix read by the CFSM thus far ((-P) 
concatenated with # . However, if the symbol is t , we must allow the 
CFSM to continue reading to determine the characteristic string. (In this 
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case the machine would read t i and enter state 10, thus determining that 
the characteristic string is |-P t i # . ) 



In fact, we show below that no matter what canonical form a of G 

we are given, if a prefix <p of a. accesses state 7, then we can determine 

via one symbol look- ahead whether a's characteristic string is p# or 

o 

<p t . . . In particular, if we look one symbol ahead and see a symbol in 
the set { -| , +, ) } , the characteristic string is <p# , but if we see one in 
the set { t } , it is <p t . . . 

LALR(k) Grammars. The above discussion and example might lead 
one to think that perhaps every LR(k) grammar has the property that its 
sentences can be parsed, in a manner similar to that just illustrated, by 
using its CFSM and some look-ahead sets associated with the transitions 
from inadequate states. Unfortunately, this is not the case. However, for 
purposes of discussion let us informally define a CF grammar to be LALR(k) 
(for look- ahead LR(k)) if and only if it has the above stated property. 

Clearly every LALR(k) grammar is LR(k), since the determination 
of characteristic strings for such a grammar is based on some knowledge 
of left context and at most k symbols of right context. In fact, the deter- 
mination concerns only the equivalence class of the left context. Further, 
a minimum number of equivalence classes is involved, since we use an FSM 
with a minimum number of states to remember relevant information about 
left context. 
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We illustrate in Chapter 5 that the LALR(k) grammars are a subset 
of the LR(k) grammars by giving a grammar for which adding look- ahead 
alone is not a sufficient modification to the CFSM; it must have some of 
its states split to make it remember more about left context; i. e. , to 
increase the number of equivalence classes of left context. 

Unfortunately, again as we shall see in Chapter 5, even the LALR(k) 
grammars cannot be described as a "simple" subset, since the computation 
of the look-ahead sets for some of those grammars is distinctly nontrivial. 
Thus, if we are to have a parser-constructing technique which grows in 
complexity as it discovers the complexity of the grammar at hand, we 
should not jump from a procedure covering the LR(0) grammars to one 
covering the LALR(k) grammars. 

Instead, we consider next a smaller subset of the LR(k) grammars 
which are distinguished both by the fact that adding look-ahead to the 
corresponding CFSMs is sufficient to render them useful for determining 
characteristic strings and that the computation of look-ahead sets is 
simple. It turns out, as we shall see in Section 4. 8, that even this 
smaller subset is a large and useful set of grammars. 

4. 2 Simple LR(k) Grammars 

Expediency dictates that we define this subset of the LR(k) grammars 
in terms of our parser-constructing technique, as we did in the case of 
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the LALR(k) grammars. This is not unreasonable since there seems to be 
no good, intuitive definition in terms of canonical forms and parsing decisions, 
anyway. 

The "simple" function which is central to our definition and which is 
useful for computing look-ahead sets is as follows. 

Definition 4. 1 . Let k be a positive integer and let 

G = (V , V , S, P) be a CF grammar, one of whose 

nonterminals is A. Then 

F^(A) = {(k:/S) e V* | S-*pAj8 for some p,0] . t 

Thus, F (A) is the set of all terminal strings of length k which may follow A 
in a canonical form of G. We are interested in look-ahead sets containing 
only terminal strings because our ultimate DPDAs will operate in a strictly 
left- to- right manner and will be applied to nothing but sentences. 

As an example we compute F (P) for grammar G : P appears in the 
right parts of two productions. The production T - P t T implies that t is 
in F (P). The production T - P implies that all the strings in F (T) are 
also in F (P). E -» E + T and E - T each imply that the members of F (E) 
are also in F*(T). S- f- E H implies that -\ is in F (E); E - E + T adds +; 



■f 
Our set notation is an abbreviation of the usual mathematical notation: 

{ctc V* |S-*pA/S for some p, and <j = k:#} . 
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and P- (E) adds ")". Thus, we have determined that F T (P) = {t,-|, +, ) }, 

and in the process that F T (T) = F T (E) = { -\ , +, ) } . 

Warshall (War 62) has a fast "bit-matrix technique" which can be 

used (Che 67) for computing these sets for k = 1. This is particularly 

important since we expect the large majority of the grammars of interest 

to be "Simple LR(1)" as we indicate in Section 4. 8. Further, for those 

few grammars which are not "Simple LR(1)" we expect to have to resort 

to k = 2 or 3, say, with respect to only one or two inadequate states. Thus, 

we have a reasonable step up in complexity from the LR(0) grammars. 

We now define the look-ahead sets in terms of which we later define 

the "Simple LR(k)" grammars. 

Definition 4. 2 (Recursive on the value of k. ) Let G be a CF 

grammar and k be a positive integer. There is associated 

with each terminal- and #- transition of G 's CFSM a simple 

k-look-ahead set which is as follows. For a # -transition, 
p 

where production p is A -•' w, the set is F (A). For a 
transition under the terminal t the set is ft} if k = 1 and 
otherwise ttjS' e V | the t-transition is to a state N and 
j3' is in the simple (k-l)-look-ahead set associated with 
some transition from N} . 

Comments: (1) We do not define look- ahead sets for transitions 
under nonterminals because our ultimate DPDAs will have no such transitions, 
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and (2) although for ease of definition sets are associated with every 
terminal- and #- transition of the CFSM, we are interested only in the 
sets for transitions from inadequate states. 

For the value as an example we illustrate the computation of the 
simple 3-look-ahead set for the t-transition in Figure 4. 1. The 
computation is actually unnecessary for grammar G 1 , since G.. is 
"Simple LR(1)". 

First, we follow all paths leading from state 7, never taking 
transitions under nonterminals, until either a string of length three is 
spelled out or until the terminal state is reached. The strings spelled out 
by all such paths are ti# , t(i, and t((. Next, the desired set of strings 
can be derived from these strings as follows. First, each string which 
contains no #- symbol is in the desired set. Second, for each string of the 

form ct# , where production p is A - , u> and |ct| = n, every string which can 

k~n 
be formed by concatenating <j with a member of F (A) is in the desired 

set. In our special case the latter means t i concatenated with the members 

of F (P). Thus, the simple 3-look-ahead set for the t-transition is 

{ t(i, t((, tit/ fi-\, ti+, ti)]. 

Finally we come to our main definition. 

Definition 4. 3. Let k be a positive integer. A CF grammar 

G is Simple LR(k) , abbreviated SLR(k\ if and only if for each 
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inadequate state N (if any) of G's CFSM the simple k-look- 
ahead sets associated with the (terminal- and #-) transitions 
from N are mutually disjoint. G is SLR(O) if and only if it 
is LR(0). 

Our example grammar G is SLR(l). Proof: The simple 1 -look- 
ahead set associated with the t-transition of its CFSM is ft } and that 
of the # -transition is F'(T) = { -\ , +, ) }, as we have seen. Obviously, 
these sets are disjoint. 
4. 3 SLRkFSMs 

We now turn to the question of how to explicitly encode look-ahead 
sets into CFSMs. We desire an explicit encoding for two reasons: (1) 
it facilitates proofs that "CFSMs-plus-look-ahead sets" can be used to 
determine characteristic strings, and (2) it facilitates our discussion of a 
technique for converting those machines to DPDAs. 

The encoding is accomplished by adding to each CFSM transitions 
under "generalized symbols". If R is a look-ahead set associated with 
a given X-transition (X not a nonterminal) of the CFSM, then X is a 
generalized symbol associated with the X-transition and the set R. 
Definition 4. 4. Let G be an SLR(k) grammar. We construct 
G's SLRkFSM from its CFSM as follows. For each inadequate 
state N (if any) of the CFSM and for each X-transition (X not a 
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nonterminal) from N having associated with it the simple k- 
look-ahead set R, we add a transition from N, under the 
generalized symbol X , to the terminal state. 



Clearly an SLRkFSM is a reduced, deterministic FSM. It accepts 
the characteristic strings of G plus the strings in the set {<pX | <p accesses 
an inadequate state N of G's CFSM and N has an X- transition (X not a 
nonterminal) with which is associated the simple k-look-ahead set R} . 

As in the case of CFSMs we use the terms "read", "reduce", and 
"inadequate" with regard to states of SLRkFSMs, in the obvious way. 
However, for emphasis we sometimes refer to the inadequate states as 
"modified- inadequate states". 

In the case of grammar G , its SLR1FSM is the graph in Figure 4. 1 
with state 7 replaced by the following: 
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In the sequel we sometimes use an abbreviated notation for modified- 
inadequate states. For instance, the above can be abbreviated: 



-* — >n 






o 8 



We emphasize that this is only an abbreviation. Our theorems below are 
easier to prove in terms of the former notation than the latter. 

The modified stack-algorithm. In a manner similar to the way in 
which we developed DPDA-parsers for LR(0) grammars, we first state a 
stack- algorithm which uses an SLRkFSM to determine characteristic strings, 
and then we convert the SLRkFSM to a DPDA which simulates the stack- 
algorithm. Our stack- algorithm here is simply our previous one modified 
to "look ahead" at the appropriate times. We present the algorithm next 
and prove that it works correctly afterward. 

Commencing with a string a = r\, where y\ is in L(G), with an empty 
stack, and with G's SLRkFSM in its starting state: 

(i) Apply the SLRkFSM to QL; store on the stack the names of the states 
entered by the machine as it reads. 

(ii) If, after reading some prefix <p of a such that a = <pj8, the machine 
enters a reduce or inadequate state N, then 
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(a) if N is a reduce state whose only transition is under # , 
where production p is A -♦ 00, then output p, pop the top |co| 
names off the stack, return the SLRkFSM to the state whose 
name is at the top of the stack, set a. = A/3, and go to step (iii). 

(b) if N is an inadequate state with (among others) transitions 

R l R 9 R n 

under the generalized symbols X , X , . . . , X , compare 

the strings in the sets R , R , . . . , R with k:8. Exactly one 

l £ n 

match will occur, say with a string in R.. 

(1) If X. is a#-symbol, execute step (ii), part (a), 

as if N were a reduce state whose only transition is 

under X.. 
l 

(2) However if X. is a terminal symbol, treat N as if 
it were a read state (i. e. , as if it had only its 
transitions under symbols), continue the reading 
and name- storing processes, and return to step (ii). 

(iii) If a. = S then stop; otherwise, return to step (i). 

Proof . Since the present stack- algorithm is like our previous one 
except for the addition of a procedure related to inadequate states, we need 
only prove that it operates correctly when the SLRkFSM enters such a state. 
Informally, we prove in Theorem 4. 1 that, if the SLRkFSM reads to the end 
of a canonical form's stack string, the algorithm will correctly determine the 
characteristic string. Then, in Theorem 4. 2 we prove that in reading the 
stack string the algorithm will not make an incorrect choice before reaching 
the end. 
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Theorem 4. 2 . Let G be an SLR(k) grammar and a = <p0/3 be a 

canonical form of G with characteristic string <P0# sucn tnat 

is in V' but 6 f €. Then, if <p accesses an inadequate state 

N of G's SLRkFSM having transitions under the generalized 

R R R 

symbols, X , X , ...,X n for some n > 2, the string 

k-0S is in R. but not in R. for 1< i t j < n, such that X = 1:0. 

Proof: k:0j8 may appear in at most one of the sets R^ R^ . . . , R^ 

since the sets are mutually disjoint. k:0/3 must appear in R. 

such that X = 1:0 for the following reasons. Since both G's 
i 

SLRkFSM and its CFSM accept <p0# , there is a path leading 
from N (of both) which spells out 0# . It is easy to see from 
the definition of a simple k-look-ahead set R for a terminal 
transition (in particular, one under 1:0) that if I 0l > k then 

k:0 is in R, whereas if I el = n < k then every string formed 

k~n 
by concatenating with a member of F T (A) is in R, where 

production p is A - w. The latter includes k:0j3 by definition 
of F* (A). Q- E - D - 

4. 4 Minimizing Look- ahead 

We noted in Chapter 2 (page 31) that the smallest value of k for which 
a grammar is LR(k) is limited by the worst case of necessary look-ahead. 
A similar statement is true regarding the SLR(k) condition. In fact, we 
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could have defined SLR(k) grammars in the following alternate way. We 
could have first defined a grammar G to be "SLR(k) with respect to" a 
given inadequate state of its CFSM, in the obvious way. Then we could 
have defined G to be SLR(k) if and only if it is "SLR(k) with respect to" 
each of its CFSM's inadequate states. 

This alternate definition emphasizes the fact that the look-ahead 
sets for the transitions from a given state N may be computed for the 
smallest value of k such that the sets are mutually disjoint. In effect, 
we recognized this fact to a limited extent in Theorem 4. 1; i. e. , we 
recognized that every grammar is "SLR<0) with respect to" each reduce 
state of its CFSM. Only notational and expositional difficulties prevented 
us from incorporating this fact into our definition of SLRkFSMs and 
Theorems 4. 1 and 4. 2, rather than belatedly bringing it up now. 

Fine tuning . In some cases not only may the amount of look-ahead 
required be different for distinct states, but even a single state may have 
strings of various lengths in its look-ahead sets. Consider, for instance, 
a state N having only the two look-ahead sets, {ab, cd}, and [ae] . Clearly, 
if the SLRkFSM is in N and the next symbol to be read is c, we need not 
investigate the second symbol to make the associate parsing decision. 
That is, the first set above may be changed to { ab, c} . 

In general, look-ahead sets may have the lengths of their strings 

minimized as follows. Consider a state N with look-ahead sets R 1 , R_, . . . R^ 

12, n 



-81- 



for some n > 2. We change each set R. by removing from the right end 
of each string in R. the maximum number of symbols such that the result 
is not a prefix of a string in R., for 1 < i f j < n. Clearly, the sets remain 
mutually disjoint after these changes. 

Note that this optimization is not applicable to simple 1 -look-ahead 
sets, since e is a prefix of every string. 

4. 5 The Conversion of SLRkFSMs to DPDAs 

It should be clear from the modified stack-algorithm that the 
transformations implied by Figure 3. 2 remain valid ones, as regards the 
read and reduce states of our SLRkFSMs. Furthermore, the computation 
of look-back states implied by Figure 3. 2a is also valid for the #-transitions 
from inadequate states. Thus, all we need now is one more transformation 

rule; i. e. , one for mapping modified- inadequate states, whose associated 

t 
look-back states have already been computed, into look-ahead states of 

a DPDA. The appropriate transformation is implied by Figure 4. 2, and 

the conversion technique goes as follows. 

First, we apply the transformation implied by Figure 3. 2a to each 

reduce state of the SLRkFSM. Also, for each inadequate state I of the 



+ 
Again we are abusing strict automata theory by allowing our DPDAs to 

"look ahead". We do so for the sake of simplicity and practicality. It is 

well known (Knu 65) that DPDAs without "look ahead" can perform the same 

computations as ours. 
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Figure 4.2. The transformation for converting modif ied- 
inadequate states to look-ahead states. This transformation 
plus those implied by Figure 3.2 are all that are needed for 
converting an appropriate FSM to a DPDA-parser for any LR(k) 
grammar. 
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machine and for each # -transition T from I, we apply the former transformation 
to I, as if it were a reduce state whose only transition is T. The result 
after this first step is, of course, a machine with "inadequate states" of 
the form indicated in the left part of Figure 4. 2, where if n = 1 then 
m > 1, but if n > 1 then m > 0. 

In the case of the inadequate state 7 of G 's SLRkFSM (illustrated 
in Section 4. 3), the result is as follows: 
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Next, we apply to each inadequate state I resulting from the first step 
the transformation implicit in Figure 4. 2. The latter indicates a conversion 
to a look-ahead state I of the DPDA. The intent, of course, is that when 
the DPDA is in state I it should simulate the modified stack- algorithm when 
the SLRkFSM is in state I (recall step (ii) of the algorithm). 

The result of applying this second step to state 7 illustrated above is 
as follows. 
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Finally, we apply the transformations implied by Figures 3. 2b and 
3. 2c to the machine, and we have the desired "DPDA with look-ahead", 
except for optimizations. 

Optimizations. Since the optimizations discussed in Section 3. 6 
applied only to look-back, which is independent of look-ahead, each of 
those optimizations is also applicable to "DPDAs with look-ahead". Only 
one more optimization presents itself, and it is applicable only to (the very 
important case of) 1- symbol look- ahead states. We illustrate this final 
optimization in conjunction with the presentation in Figure 4. 3 of the fully 
optimized DPDA-parser for grammar G 1 . 

For present purposes consider only state 7. The intent is that, when 
the DPDA enters state 7, it should look-ahead as usual and, if the next symbol 
is [, +, or), it should enter state 16 next, as usual; however if the symbol is 
t , it should move its read head to the right one place and then enter state 8. 
That is, the state is sort of a combination "look-ahead read- state", and it 
eliminates the inefficiency of investigating the t twice. We allow such states 
because it is easy to implement them, as we show in Chapter 7. 
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Figure 4.3. The fully optimized DPDA-parser for grammar G 1 . 
This figure was derived from Figure 4.1. The dashed arrows are 
not intended as part of the machine. Recall that when the DPDA 
enters a state represented by a square, it pushes the name of 
that state on its stack. 
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The dashed arrows in Figure 4. 3 indicate the transitions under 
nonterminals which were removed from G 's SLR1FSM in forming the 
DPDA. That is, they are not to be taken as part of the DPDA. We include 
them to facilitate future discussions and to aid the thoroughly interested 
reader in reviewing the transformation rules as they apply to this example. 

Recall that when the DPDA enters a state represented by a square, 
it pushes the name of the state on the stack. 

4. 6 Time -Efficiency 

From the automata- theoretic viewpoint a parser is simply a translator; 
it is a machine which translates strings into parses; i. e. , strings of symbols 
into strings of production numbers. We adopt this viewpoint for the purpose 
of discussing the time -efficiency of our parsers. 

We informally define time -efficiency in terms of an "ideal machine". 
The latter is assumed to be able to translate a string of n symbols into a 
string of m symbols with only 2(n+m) "machine operations" of approximately 
equal complexity (execution time); i. e. , it takes n reads, m outputs, and 
n + m accompanying state -changes. By the "time-efficiency" of a DPDA, 
then,we mean the number of machine operations required by the ideal machine 
to perform a given translation divided by the number required by the DPDA 
to perform the same translation. 

In Table 4-1 we illustrate the history which results when the DPDA 
of Figure 4. 3 is applied to the string tk = |-i t i+ H i* 1 MGj). Counting 
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Table 4~1 . The history which results when the DPDA of Figure 4. 3 is 
applied to the string 77 = hi t i + i-| . 
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all the pushes, pops, reads, outputs, and state-changes executed by the 
machine, it requires 52 machine operations to map 17 1 into its canonical 
parse. Since 1 77 - l = 7 and since there are nine symbols (production 
numbers) in the parse, the ideal machine could have performed the trans- 
lation in 2(7 + 9) = 32 machine operation. Thus, the time -efficiency of 
the DPDA is 32/52 or about 62% for y) . 

If a similar table is constructed for the unoptimized version of G 1 's 
DPDA parser, we find that it takes 79 machine operations; i. e. , for tj 
its time -efficiency is 32/79 or about 41%. Thus, the optimized DPDA is 
1. 5 times as fast as the unoptimized one. 

A general case. Let us consider the time-efficiency for a more 
general case. In particular, let us compute the worst-case time- efficiency 
for the DPDA-parser of some SLR(l) grammar, when it is applied to a 
string of n symbols having a canonical parse ofm symbols. We merely 
analyze the behavior of the DPDA (assumed to be similar to the one of 
Figure 4. 3) and determine the maximum number of machine operations 
which can be associated with each of the n+m symbols. 

At worst we may need a push, a read, and a state-change for each 
of the n input symbols, since we may need to push the name of each read 
and look- ahead state. For each of the m output symbols (i. e. , for each 
reduction), we may need a push, a look-ahead and a state-change, then a 
pop, an output and a state-change, and finally a look-back and a state- 
change. 
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Thus, the DPDA could take as many as 3n + 7m machine operations 
to perform the translation. The time -efficiency in the worst case, therefore, 
is 2(n+m)/(3n + 7m), which is a minimum of 29% when m » n. 

4. 7 Error Detection 



In the present section we have three points to make regarding the 
actions of a DPDA-parser for an SLR(k) grammar G when the DPDA is 
applied to a string 77' not in L(G): 

(1) The machine must ultimately detect the "error". 

(2) It may detect the error either while reading or while looking ahead. 

(3) It may not detect the error as soon as it would have had its look- ahead 
sets been computed by using functions complex enough to cover the LALR(k) 
or general LR(k) grammars. 

(1) The first point follows from the facts that the DPDA ultimately 
simulates our canonical parser of Chapter 2 and that there exists no 
canonical parse for 77'. (Recall the argument at the end of Section 3.6.) 

(2) Our DPDAs without look-ahead had only one way in which to 
abort, namely by entering a read state with no transition under the next 
symbol to be read. Clearly, by adding look-ahead states we add another 
possibility. The machine may enter a look-ahead state N such that none 
of the strings in the look-ahead sets of the transitions from N match the 
beginning of the string remaining to be read. 
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ts) We illustrate our third point by example. Consider an SLRkFSM 
with two inadequate states N- and Ng. each having a # -transition, where 
production pisA^o. For i = 1, 2 let RC. = {/3 e V T |<p0 is a canonical 
form with characteristic string (p# such that<p accesses state N.3 . Assume 
that RC and RC are disjoint sets. Then if <p, accesses N and 2 is in 
RC , <p J3„ is not a canonical form. And yet, if our DPDA is in state Nj 
with implicit left context <p 1 and right context P 2> it will not detect the 
error immediately via look-ahead. This follows because the simple 
k-look-ahead set corresponding to the # -transition contains k# 2 , by 
definition. 

Clearly, if the look-ahead sets of the # -transitions from N^ and N g 
are reduced to R. = {k^3|/3c RC.} for i = 1, 2, respectively, then the DPDA 
continues to correctly parse sentences in L(G). However, after this change, 
it will detect the above error via look- ahead when it is in state N^ since 
k'j3 is not in R . 

What we have 'covered is that, if the look-ahead sets for a state N 
are computed independently of the left contexts which access N, as is the 
case when we use f)L the sets sometimes contain strings which cannot 
begin a legitimate right context when the machine is in state N. Thus, in 
a sense, ^ is not always "restrictive" enough. Note, however, that this 
situation may obtain only if there is more than one transition in the machine 
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under some #- symbol. (In our practical example in Chapter 7 only 2 of 

82 productions have more than one corresponding #-transition in the CFSM. ) 

Our example also illuminates the difference between LALR(k) grammars 
and SLR(k) grammars. If a grammar G is LALR(k) but not SLR(k) for a 
particular value of k, then some state of G's CFSM must have overlapping 
simple k-look-ahead sets. And yet, if those sets are reduced by considering 
corresponding left contexts, they become mutually disjoint. In Chapter 5 
our first example illustrates such a grammar, and we find that in general 
the functions necessary for computing look- ahead sets for LALR(k) grammars 
are the same complex functions which are necessary for general LR(k) 
grammars. 

4. 8 On the Extent of the SLR(k) Grammars 

We should like to give the reader some intuitive feel for the usefulness 
and the extent of the SLR(k) grammars; that is, a feel for which grammars 
are SLR(k) and which are not. But alas, given our conceptual framework 
there seems to be no good intuitive explanation, so we resort to discussing 
some inclusion relations between SLR(k) and other well-known grammars. 

In the Appendix we show that the "weak precedence" grammars of 
Ichbiah and Morse (I&M 69) are included in the SLR(l) grammars. Since 
those authors have shown that the "simple precedence" grammars of 
Wirth and Weber (W&W 66) are a subset of the "weak precedence" ones, 
it follows that the "simple precedence" grammars are SLR(l). Further, 
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it is easy to see from the proofs in the Appendix that, if the "precedence 
relations" were extended to include k symbols of right context, the 
resulting "right- extended weak-precedence" grammars would also be 
SLR(k). This leads us to suspect that the "(1, k) precedence" grammars 
of Wirth and Weber (W&W 66), the "(0, k) bounded context " grammars 
of (F&G 68), and the "ICOR (0, k)" grammars of Lynch (Lyn 68) are all 
SLR(k). 

But these inclusions really undersell the SLR(k) grammars, for 
the latter include many grammars which are in none of the abdve classes 
or their generalizations. They include all_ LR(0) grammars and many 
other LR(k) grammars for which arbitrary left context is necessary to 
make parsing decisions. Our example grammar G_ is a case in point, 
as we noted in Section 2. 4. 

The ability of the CFSM for a given grammar to remember some 
left context which may be arbitrarily far to the left seems to arise because 
the confusion between contexts, which may obtain when two productions may 
be applicable to the same part of a string, is minimized in the CFSM in 
the following sense. If there exists an inadequate state N in the CFSM, 
then no matter how much left context we investigate we will not be able 
to make the parsing decision associated with N. The former statement 



These grammars should really have been called "(0, k) bounded right 
context" (Flo 64). 
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is implied by Lemma 3. 3: if <p accesses N, then there exist characteristic 
strings <p# and cpd# and corresponding canonical forms. 
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Chapter 5 
PARSERS FOR GENERAL LR(k) GRAMMARS 

5. 1 Objective 

In the present chapter we continue the development of our parser- 
constructing technique. However, before we proceed we (1) place the 
foregoing results into perspective by reviewing them from the viewpoint 
of a TWS attempting to construct a parser for a given grammar, (2) preview 
the results of the present chapter, and (3) disclaim any interest in these 
results from the practical viewpoint. 

Review. Assume that we are given a CF grammar G and that we are 
to construct a parser for it. We first assume that G is LR(O) and construct 
its CFSM. If the CFSM is adequate, G is LR(0) so we convert its CFSM to 
a DPDA and are finished. If, however, the CFSM is inadequate, we deter- 
mine if G is SLR(l) by computing the simple 1 -look-ahead sets for the 
transitions from the inadequate states. If the sets for each inadequate state 
are mutually disjoint, G is SLR(l) so we convert the CFSM to an SLR1FSM 
an<Llhen convert the latter to a DPDA with one- symbol look-ahead. As noted 
above, we expect none of the grammars of interest to be LR(0), but most of 
them to be SLR(l). 

Of course, it may be that there are one or more inadequate states 
which have overlapping, simple 1 -look-ahead sets, in which case our work 
is not done. For the transitions from each such state we compute the simple 
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k-look-ahead sets for some values of k > 1. Since the time -efficiency of 
our ultimate parser will go down as k goes up (because look-ahead means 
multiple interrogation of some symbols), we shall undoubtedly be interested 
in only a restricted range of values of k, probably k < 3 or so. If it turns 
out that the simple k- look- ahead sets are mutually disjoint, i. e. that G 
is SLR(k) for an acceptable value of k, then we can construct a DPDA- 
parser which has perhaps some one-symbol, and one or more k-symbol, 
look-ahead states. 

In some cases, of course, we shall find that G is not SLR (k) for 
an acceptable k. However, there remains the possibility that G is LR(k) 
for such a k. For instance, our first example below is a grammar which is 
not SLR(k) for any k, but which is_ LR(1). In such a case we need more 
complex methods, first for determining if a grammar is LR(k) for a given 
k and second for constructing a corresponding parser if the former is the 
case. 

Preview. These more complex methods are the subjects of the 
present chapter. In some cases (more of the LALR(k) grammars) we 
find that our modification of the CFSM is the same as for an SLR(k) 
grammar, but that the look-ahead sets are more difficult to compute then 
for the latter. In other cases, however, we find that some states must 
be split into several copies so the CFSM will remember more left context 
and so we can check corresponding right contexts to determine characteristic 
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strings. The determination of the appropriate state- splitting and 
corresponding look-ahead requires techniques which are substantially 
more complex, computationally, than our previous methods. 

We introduce these notions by defining a set of grammars via 
some "sets of bounded-context pairs" and by showing how to extend our 
techniques to cover those of the latter grammars which are not SLR(k). 
The reasons for state -splitting come out rather naturally in the discussion, 
which leads eventually to a method for covering all LR(k) grammars. 

Impracticality. The reader should keep in mind throughout this 
chapter that we expect to have to resort to these techniques only rarely, 
if at all. This expectation stems primarily from two sources. First, 
the grammars which were shown in Section 4. 8 to be included in the SLR(k) 
grammars have been found to be quite useful for describing much of the 
syntax of many programming languages (FAG 62). The prime example 
is, of course EULER (W&W 66). Second, our own experience with languages, 
particularly with the language whose grammar and translator are presented 
in Chapter 7, has been especially encouraging in this respect. The latter 
grammar generates an extremely powerful, useful, and readable language 
with many constructs in common with languages like FORTRAN, ALGOL, 
EULER, PL /I, etc. The grammar was designed to be unambiguous, small, 
concise, and useful as a syntactical reference for programmers (i. e. , for 
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the determination of operator precedences, associativities, etc. ), but 
it was not designed with our parser-constructing techniques in mind. 
Indeed, the techniques did not exist when the grammar was designed. 
And yet, the grammar turns out to be SLR(l). 

Thus, the material in this chapter is here more because of a desire 
for completeness and for a fuller understanding of the LR(k) grammars than 
for its expected usefulness in practice. Consequently, we do not in this 
chapter concern ourselves particularly with the efficiencies of the techniques 
discussed. We are primarily interested in getting across the ideas. 
5. 2 "Bounded- Context" Examples 

In this section we analyze two grammars which are not SLR(k). The 
first is an LALR(k) grammar for which the look-ahead sets can be determined 
by using a function which computes "bounded-context pairs". The second 
grammar is not LALR(k), but it is LR(k); i. e. , its CFSM needs both state- 
splitting and look- ahead. The above mentioned function is found to be 
useful in the second case, also. 

The two examples motivate the definition of a set grammars which 
we call "L(m)R(k) M , and a parser-constructing technique to cover them. 
These grammars include, and their definition has similarities with the 
definition of, the "bounded right context" grammars (Flo 64); i. e. , those 
grammars whose sentences can be parsed during a deterministic, left- to- 
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right scan with each parsing decision being made on the basis of the knowledge 
of a bounded amount of context surrounding the decision point. 
Example 1. Consider the CFSM shown in Figure 5. 1. It corresponds to 
grammar G which contains the following productions 



(0) S - 1-E-l 

(1) E - a A d 

(2) E - a e c 



(3) E - b A c 

(4) E - b e d 

(5) A- e 



There are two inadequate states in the CFSM, states 7 and 12, both involving 

production 5 whose left part is A. Since G generates only four strings, namely 

l-aed-1, \ aec -\ , |- bee H , and |- bed -\ , it is trivial to compute the appropriate 

simple k-look-ahead sets. In particular, for any k > 1, F (A) = { c-| , d-j } , 

is the set for the #_- transitions; that for the c- transition from state 7 is 

5 ' 

{c-|}; i.e., c followed by the only member of F (E) = {-|}; and that for 

the d-transition from state 12 is { d-)} ; i. e. , d followed by the only member 

k~l 
of F (E). We represent this information, as we did in Chapter 4, using 

generalized symbols: 
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Figure 5.1. The CFSM for grammar G 2 . 
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Clearly, G is not SLR(k) for any k since the simple k-look-ahead sets 
have strings in common for both the inadequate states regardless of the 
value of k. 

However, because the grammar generates only four strings, we 
can easily determine by exhaustive tests that the look-ahead sets can be 
reduced to those indicated as follows; i. e. , that G is LALR(l). 



U} 



■U) 



t> 8 



[dj 



a 



12 

I 



[12. 



ic} 



ra 



Clearly, a parser constructed using these look-ahead sets is a correct 
one for this grammar. But how do we compute these look- ahead sets in 

general ? 

m k 
For G_ and many other grammars we can use the function C which 

is defined below and whose value is a set of ordered pairs of left and right 

contexts. The definition requires the following two preliminary definitions: 

(1) if <p is a string, then<p:m denotes the last m symbols of <p if |<p| > m and 

<p otherwise, and (2) {(V , V )} denotes the set of pairs whose first components 

if. $ 

are in V and whose seconds are in V_. 
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Definition 5. 1 Let G = (V , V , S, P) be a CF 

grammar and m and k be positive integers. Then 

m C k (# ) = {(po>:m,k:/3) € {(V*, V* )} I S -*R pAfi and 

production p is A -• co} . 
Each pair in this set consists of the last m symbols of a stack string 
(p and the first k symbols of a corresponding input string fi, respectively, 
such that the canonical form a =<p/3 has a characteristic string p# . In 
other words, we have the ordered pairs of left and right contexts which 
may surround a point in a canonical form where, during a deterministic, 
left-ro-right parsing, we should decide to make a reduction using production 



The C (# ) sets play a part in the definition of "L(m)R(k) M 
grammars similar to that played by the F (A) sets in the definition of 
SLR(k). The former sets can be computed in a way resembling the manner 
in which the F (A) sets are computed (recall the example on page 71 ), except 
that, of course, corresponding left and right contexts must be tallied. 
The former sets are certainly more difficult to compute than the latter, 
but their computation is a reasonable next-step in our parser- generating 
procedure. 

In the case of grammar G_ we have 

V(# 5 ) = {(ae,d),(be,c)} 
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and we can observe that any string ending in ae will not access state 12, 

therefore the look-ahead set for the # -transition need not contain the 

5 

string d-| . Similarly, the look-ahead set for the # "transition from state 



7 need not contain c-| . If we minimize the lengths of the strings in the 
look-ahead sets which result after these deletions, we arrive at the same 
sets deduced above. 

Example 2. Our second grammar G is rather similar to G . 
G 's productions follow, and its CFSM is illustrated in Figure 5. 2. 

(0) S - j-E-j (4) E- b B d 

(1) E - a A d (5) A- e 

(2) E - a B c (6) B- e 

(3) E - b A c 



Again we have a grammar which is not SLR(k), since 

k k 

F (A) = {c-\, d-j) = F (B) for any k> 1. In this case, however, the 

conflict is not as easily removed as was that of the previous case. If we 

compute the context pairs, we get 

2 C 1 (# C ) = {(ae, d), (be, c)} and 

o 

2 C 1 (# 6 ) = {(ae, d), (be, d)} . 



t 
This example illustrates that the simple k-look-ahead sets may contain some 

strings which cannot appear as the prefix of the input string £ of a canonical 

form a =<p/3 such thatp accesses the state in question; i. e. , that the set F H(A) 

is not sufficiently restrictive. In the current case this "causes" the grammar 

not to be SLR(k). In other cases it may only cause the parser to be slower 

(because it checks too many possibilities for look-ahead) and to detect some errors 

somewhat later than it otherwise would. Recall the discussion in Section 4. 7. 



-103- 



-o 1 



-{2}-± 



-C-T31 — ^->(M 



^ 



#1 



*>G?j 



v — E -i>fn — — t>fel — z ^->E 



-o9 



ft 



f 



-O Z5- 



-OiO 



-oil 



-o 12 



* 



3 



E 



^-^-HmH ^HE-^Hl 



v e y 



Figure 5.2. The CFSM for grammar Go. 
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Analyzing this case as before we see that these context pairs imply no 
restrictions on the look-ahead sets, so we are left with the overlap: 



f 5 



t>m 



a (cH.dH) 

f 6 



o\!l 



There is, however, a simple solution in this case, too. Note 
that we could make the parsing decision associated with state 9 by looking 
at both our left and right contexts after arriving there. If we look to our 
left and see "ae" then, if we look to our right and see d, # is the correct 

D 

transition, but if we see c, # is correct. On the other hand, if we see "be" 

6 

to our left then the correspondences are d with # and c with # . 

Although we could build a parser for G which decides whether to 

o 

reduce using production 5 or 6 by looking at both left and right context, we 
prefer to eliminate the special look -to -the -left for two reasons: (1) it 
would be less time -efficient and also possibly less space-efficient than an 
alternate approach which we give below, and (2) we can easily generalize 
our other approach to cover all LR(k) grammars, but we cannot easily 
generalize this one. 

What we chose to do is to "build into the machine" some extra memory 
for the extra left context. Note that in the case of grammar G_ the machine 
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implicitly remembers the appropriate left context; i. e. , we know that if the 
machine is in state 7, the two-symbol, left context is "ae", whereas if 
the machine is in state 12, the context is "be". Unfortunately the CFSM 
of G forgets this context; i. e. , when the machine is in state 9 the left 
context may be either "ae" or "be". 

We solve the problem for G by splitting state 9 into two copies, 

1 2 

9 and 9 , as shown in Figure 5. 3. Note that the look-ahead sets are 

indicated and that there is no overlap. The sets may be determined (in 

this case) just as they were for the CFSM of G , after the state splitting 

has been performed. 

5. 3 L(m)R(k) Grammars 

The preceding examples motivate the definition of a set of grammars 
which can be described informally as those whose sentences can be parsed 
by using (1) corresponding CFSMs to determine potential characteristic 
strings and (2) sets of context pairs computed using C to make parsing 
decisions associated with inadequate states. Our method of defining these 
grammars is similar to our method of defining the SLR(k) grammars, and 
we point out the similarities as we proceed. 

We first need two preliminary definitions. 

Definition 5. 2 . Let G be a CF grammar, m be a positive 

integer, and N be a state of G's CFSM. Then the set m L(N) 
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Figure 5.3. The CFSM of grammar G^ after state-splitting 
and with look-ahead sets indicated via generalized symbols, 



The machine is later called the L2R1FSM of Gj, 
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is the set of left contexts of length m which end strings 
which accessN; i. e. , 

Tn •& 

L(N) = {<p:m) e V | <p accesses N} . 

This set can be computed by following all possible pathes backwards 
through the CFSM, from N, for m steps or until the starting state is 
reached. Since the connectivity of the CFSM (graph) can be represented 
by a bit-matrix, the computation involves some fast bit-matrix manipulations 
(Pro 59). 

Now we define some "sets of bounded-context pairs" associated with 
the transitions of CFSMs. The definition is to our "L(m)R(k) M definition 
what the definition (4. 1) of simple k-look-ahead sets is to the SLR(k) 
definition. 

Definition 5. 3. (Recursive on the value of k. ) 
Let G be a CF grammar and m and k be positive 
integers. There is associated with each transition T 
of G's CFSM a set of (m, k) - bounded - context pairs, 

Tn W 

BC (T), as follows: 

If T is a # -transition from state N then 
P 

m BC k (T) = {fe f M ) c m C k (# )| a c m L(N)} . 

P 
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Or if T is a transition under the terminal t from state N 

to state M then 

m BC k (T) = 

if k = 1 then {(a, t) c {(V*. V*)} |cr c m L(N)} 

otherwise ((cr, t/i') € {(V*, V*)} |<x € m L(N) and 

(at, ju') e BC (some transition from M)} . 

As in the case of simple k-look-ahead sets: 
(1) we do not define these sets of pairs for transitions under nonterminals because 

our ultimate DPDAs will have no such transitions, and (2) although for the ease 

of definition sets are associated with every terminal-and #-transition of the 

CFSM, we are interested only in the sets for transitions from inadequate states. 

The computation of these sets of pairs for a # -transition primarily 
consists of computing C (# ) and L(N), as can be seen from the definition. 
For a transition under a terminal and for k> 1, the computation proceeds 
in a manner similar to that illustrated above (page 73 ) for the computation 
of a simple k-look-ahead set, except that, of course, corresponding left 
and right contexts must be tallied. 

In the case of G 's CFSM and for m = 2 and k = 1, we have for 
inadequate state 9: 

2 BC 1 (the # - transition) = 2 C*(# ) = {(ae, d), (be, c)} and 
2 BC 1 (the # - transition) = 2 C*(# J = {(ae, c), (be, d)} . 

D fa 

Of course, this agrees with our results above. 
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We now come to the main definition of this section. 
Definition 5. 4 . Let G be a CF grammar and m and k 
be positive integers. Let N be an inadequate state 
(if any) of the CFSM of G. Then G is L(m)R(k) if 
and only if the sets of (m, k)- bounded-context pairs 
associated with the transitions from N are mutually 
disjoint. Also, G is L(0)R(k), L(m)R(0). and L(0)R(0) , 
if and only if it is SLR(k), LR(0), and LR<0), respectively. 

We include the three special cases solely for completeness; we do not discuss 

them further. 

Note that grammar G 3 is L(2)R(1) by definition, as can be seen 

from the disjoint sets 2 C (#J and 2 C 1 (#„) above. 

5 6 

5.4 LmRkFSMs 

We now define an FSM which can be used by our modified stack- 
algorithm of Section 4. 3 to determine characteristic strings for an L(m)R(k) 
grammar. This new machine is the CFSM modified to accept some extra 
strings in which correspondence between (bounded) left and right contexts 
is explicit. 

Definition 5. 5 . Let G be an L(m)R(k) grammar. We construct 
G's LmRkFSM from its CFSM as follows. For each inadequate 
state N (if any) of the CFSM and for each string a in m L(N), we 
follow each path backward through the CFSM under the reverse 
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of a, say to state M; from M we add a new path (of 

new transitions and new states) under or to a new 

state N'; from ,N' there is a transition to the 

R 
accepting state under the generalized symbol X 

for each X- transition (X not a nonterminal) from N 

such that R ={/if V m l (cr, u) is in the set of (m, k)- 
o I 

bounded-context pairs associated with the X-transition } . 
This results in a non-deterministic FSM. We change 
the latter to an equivalent, deterministic FSM (via 
well known techniques (H&U 69) and reduce the 
result to form the LmRkFSM. 

In the case of our example grammar G the nondeterministic 
FSM is shown in Figure 5. 4. The reduced, deterministic version, i. e. 
the L2R1FSM, is exactly the machine shown in Figure 5. 3. Thus, the 
state splitting and look-ahead sets which we deduced were necessary 
above have "fallen out" of our procedure. 

Proof. We need the following preliminary result to prove that the 
LmRkFSM can, in fact, be used to determine characteristic strings. 
Lemma 5. 1 . Let G be an L(m)R(k) grammar and N be 
an inadequate state (if any) of G's CFSM. Every string 
<p which accesses N also accesses a state N' of G's LmRkFSM 

such that for every X-transition from N there is an X-transition 

■p 
and, if X is not a nonterminal, an X -transition from N* such 
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Figure 5.^. The nondetermlnlstlc FSM which is an intermediate 
result in the process of computing the L2H1FSM for grammar Go. 
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that R = R = C/u € V„l4p:m,ju) is in the set of (m, k)- 

<p:m i 

bounded-context pairs associated with N's X-transition} . 
Furthermore, there are no other transitions from N'. 
Proof: By construction the LmRkFSM is a reduced, 
deterministic FSM which recognizes the characteristic 
strings of G plus some extra strings for each inadequate 
state N of G's CFSM. The extra string are as follows. 
If the strings accesses N, the LmRkFSM accepts the 
string <pX where X and R are as given above. Now, 
because the machine is deterministic, any string, in 
particular <p, must access a unique state, say N', 
of the LmRkFSM; because both the CFSM and the LmRkFSM 
accept the characteristic strings, in particular those with 
prefix <p, there must be an X-transition from N* for each 
such transition from N; and because the LmRkFSM accepts 
the extra strings with prefix <p, state N' must have the 
extra transitions given above. Furthermore, we have 
accounted for all strings with prefix <p which are accepted 
by the reduced machine, so there can be no other transitions 
from N'. Q. E. D. 

The following two theorems serve the same purpose with respect to an 
LmRkFSM as do Theorems 4. 1 and 4. 2 with respect to an SLRkFSM. 
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Theorem 5. 2 . Let G be an L(m)R(k) grammar and 

a = <p/3 be a canonical form of G with characteristic 

stringpf . Then the stack strings of a accesses 

a state of G's LmRkFSM which is either (1) a 

reduce state whose only transition is under # or 

P 

(2) a state (like N' of Lemma 5. 1) with transitions 

under the generalized symbols, 

R l R 2 R n 

X, , X , . . . , X , for some n > 2, such that 
i l n — 

k:j8 is in R. but not in R. for 1 < i f i < n, and such 
l i — J — 



3 



that X. = # . 
i P 



Proof: Our proof depends upon the similarity of the 
CFSM and the LmRkFSM of G. There are only two 
cases since <p must access either a reduce state or 
an inadequate state of the CFSM. (1) If it accesses 
a reduce state of the CFSM, it must also access a 
reduce state of the LmRkFSM, because they both are 
deterministic and, although the LmRkFSM accepts more 
strings than does the CFSM, the extra ones are formed 
by adding symbols to the end of prefixes which access 
inadequate states but not reduce states of the CFSM. 
Further, the only transition from the reduce state 
accessed bycp must be under # , since the machine 
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accepts cp# . (2) If <p accesses an inadequate state N' 

of the CFSM, it must access a state N' of the LmRkFSM 

with transitions under generalized symbols, by Lemma 

5. 1. Consider the sets R, , R„, . . . , R which are 

12 n 

associated with the generalized symbols labeling 

transitions from N'. These sets must be mutually 

disjoint because they were derived from the mutually 

disjoint sets of context pairs associated with the 

transitions from N as follows: each set is the set of 

right contexts which are paired with a common left 

context, in particular <p:m, in the set of context pairs 

associated with some transition from N. Thus, k:8 

can be in at most one of the sets. Furthermore, by 

R. 
Lemma 5. 1 one of the generalized symbols X. has 

X = # , and R. must contain k:S because it is computed 
l p i 

m 1c 
from C (# ) which by definition (5. 1) contains 
P 

(<p:m, k:0). Q. E. D. 

Theorem 5. 3 . Let G be an L(m)R(k) grammar and 

a =<p0j3 be a canonical form of G with characteristic 

string cp0# such that 6 is in "V but 9 f c Then, if 
to p T 

cp accesses a state (like N' of Lemma 5. 1) of G's 
LmRkFSM having transitions under the generalized 
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R. R„ R 

12 n 

symbols, X , X , . . . , X for some n > 2, the 

string k:0/3 is in R. but not in R. for 1 < i f j < n, such 

that X. = 1:0. 
l 

Proof: k:0S may appear in at most one of the sets 

R„ , R , . . . , R , since the sets are mutually disjoint, 
12 n 

as was shown in the previous proof. k:0/3 must appear 

in R. such that X =1:0 for the following reasons. G's 
11 

CFSM accepts <p0# ; thus, ifcp accesses state N of the 

CFSM, then there is a path leading from N which spells 

out 0# . It is easy to see from the definition of the set 
P 

"m 1c 

BC of (m, k)-bounded-context pairs for a terminal- 
transition (in particular, one under 1:0) that if | d\ > k 
then (<£>:m, k:0) is in BC , whereas if |0l = n< k then 
every pair (<p:m, 0^') is in BC such that ^' is in the 
set f,u' e vIJ</30 accesses state M of the CFSM and 
(cp0:(m+n), /j') is in the set of (m+n, k-n)~bounded- 
context pairs of some transition from M} . Furthermore, 
(<p0:(m+n), (k-n):/3) must be in the latter set of pairs because 

there must be af -transition from M and C (# ) 

P P 

includes the former pair by definition. Finally, since 
we have shown that the set of bounded-context pairs 
associated with the (1 :0)-transition from N of the CFSM 
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R. 
contains {p:m, k:0/3), Lemma 5. 1 implies that the (1:0) - 

transition from N* of the LmRkFSM is such that R. 

l 

contains k:00. Q. E. D. 

Summary . In review, our technique for constructing an LmRkFSM 
for a CF grammar G which is L(m)R(k) is as follows. Compute the 
context pairs associated with the transitions from the inadequate states 
of G's CFSM. Form a nondeterministic FSM by adding to the CFSM certain 
new transitions and states. The result is a nondeterministic machine which 
recognizes some extra strings in which correspondences between left and 
right contexts are explicit. Change the machine to an equivalent, deter- 
ministic FSM and and reduce it. Viola! Of course, we can minimize the 
lengths of the strings in the look-ahead sets here just as we did for SLRkFSMS. 

It should be clear from Theorems 5. 2 and 5. 3 that LmRkFSMs can 
be used by our modified stack-algorithm just as are SLRkFSMs. It 
therefore follows that we can replace "SLRkFSM" with "LmRkFSM" 
throughout the description of our technique for converting SLRkFSMs to 
DPDAs to get the appropriate procedure for LmRkFSMs. 

It should also be clear that for a given L(m)R(k) grammar, we 
need to resort to the L(m)R(k) techniques only for inadequate states with 
overlapping simple k-look-ahead sets. To formalize this we would have 
to prove theorems similar to Theorems 5. 2 and 5. 3 stated for a machine 
having reduce states, inadequate states with simple k-look-ahead sets, 
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and states like N 1 of Lemma 5. 1. That is, the new theorems would be 
a combination of Theorems 4. 1 and 4. 2, and 5.2 and 5. 3, respectively. 
We do not state and prove these theorems since the notation would get 
out of hand and since the exercise would be of little intellectual value. 

5. 5 Parsers for General LR(k) Grammars 

We now turn to the problem of constructing a parser for a general 
LR(k) grammar. That is, we want a method for covering grammars which 
are LR(k) but which are not SLR(k) or even L(m)R(k). Again we choose to 
illustrate the solution first by example and then to give the general 
solution. We do not formalize the results of this section because they 
are similar to those of the previous section, however, we do include an 
informal proof regarding the only significantly different feature. 

Example. Consider the grammar G. (also similar to G ) whose 
productions follow. 

(0) S - f-E-1 (5) A- e A 

(1) E - a A d (6) A- e 

(2) E - a B c (7) B ^ e B 

(3) E^bAc (8) B - e 

(4) E - b B d 
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The corresponding CFSM is shown in Figure 5. 5. It has one inadequate 
state, state 9. 

The grammar is not SLR(k) because F k (# ) = F k (#J = {H, d-| } 

D O 

for any k> 1. Since these sets overlap for all k, we need not bother 

to compute the simple k-look-ahead set L for the e-transition; however, 

we do so for the value as an example: 



k * 

L = {e/3 e V I j3 is in a simple (k-l)-look-ahead set 

associated with a transition from state 9} 
= [e/3 | is in{H,d-|} U LJf 1 } 

- {ecH , ed-|} u eL 

= [ec-| , ed-j , eec-j , eed-| , . . . , e c-j,e d-| , 

k-1 k-i k, 
e c, e d, e j 



for k> 2. Obviously, this adds no new overlaps. Thus, the parsing 
decision associated with state 9 about whether to read or reduce can be 
made on the basis of one-symbol look-ahead ({e} and {c, d} are the 
respective look-ahead sets); but the decision as to which reduction to make 
cannot be determined via look-ahead alone, even if we look all the way to 
the end of the string. Having discovered this, we need not discuss the 
e-transition further below, although we do so, again for exemplary value. 
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Figure 5.5» The CPSM for grammar G^, 



-120- 

Neither is the grammar L(m)R(k). Because it is small it is easy 
to compute by hand the context pairs for the transitions for state 9 which 
are as follows for m > 2 and k > 2: 



# 6 -transitxon_{ ( ^ ee , di)j <|-bee,c-|). 
d-ae^d-l), (\-be m - 2 ',c-\), 



. m-1 , j % , 
(ae dj , ( 

(e ,d-|), 



hbe m ,cH), 
(e m ,cH),} 



# -transition { (|-ae,c-|), 



._, „ __,_,„ (He, dj), 

(|-aee,c-|), (hbee, di ), 



, m-2 j. /L , m-2 , j » 
(f- ae ,c-| ), (hbe , di ), 

m - l i . .. m-l , j . 
(ae ,c-\), (be , di), 



m 



m 



<e .c-l). <e ,d-|) } 



e- transition 



{( 



|-ae, ' 
|-aee, 



Y&e 



ae 



m-2, 
m-l 



m' 



f ed-|, 
eedi, 



w 



k ~2*J 

e di, 
e. d, 



I e 



)}«{(■ 



hbe, 
j-bee, 



|-be : 
be 



m-2 

m-l 
m' 



M 



ec-\, 
eeci 



e di, 

k_1 ^ 

e k d ' 
e 



)} 



where the notation {(j f ), (j j)} is to be understood as was {(V , V T )} 



m 



m 



above. Because the context pairs (e , ci ) and (e , di ) appear in both the 
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sets associated with the #_ and # -transitions, the grammar is not 

b o 

L(m)R(k), and our informal solution of looking at a finite amount of 
both left and right context to make the parsing decision associated with 
state 9 will not work here. The problem is, of course, that the left 

context in which we are interested (the a or b) may be arbitrarily far 

t 
to our left . 

The essential reason that we shall be able to solve this problem 
is that, although the context of interest can appear arbitrarily far to the 
left at the time we need it, the states and transitions of the CFSM which 
are involved in reading that context are only a finite distance from the 
inadequate state (since the CFSM is a finite machine] ). Our solution 
again involves state- splitting, but this time to get the machine to remember 
extra context which may be arbitrarily far to the left. 

For instance, the CFSM of Figure 5. 6 must have state 9 split into 
two copies so it will remember whether an a or b is to its left. The 
appropriate FSM is shown in Figure 5.6. Note that because of space 
limitations we have drawn the FSM in the abbreviated form. Because grammar 
C . is small the reader should easily be able to convince himself that this is 



t 
In the case of Grammar G_ the CFSM is obliging enough to remember the a 

or b for us. The difference seems to be that for G_ the a or b has no 

implication about the symbols in the right context, mit only about how they 

should be parsed, whereas with G there is a correspondence between left 

and right symbols. We see no general way of discovering such complexities 

in a grammar except by trying to generate a parser for it. 
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Figure 5.6. The CFSM of grammar G^ after state-splitting and 
with look-ahead sets indicated via generalized symbols; i.e., 
dfche (abbreviated) LR1FSM for G^. 



-123- 

the appropriate FSM. Note that no more than one- symbol look- ahead is 
necessary, therefore the grammar is LR(1). 

LRkFSMs: Now, for the general case we have two questions 
confronting us : (1) how do we compute the necessary state- splitting, and 
(2) how do we compute the look-ahead sets? The answers to these two 
questions are rather similar to those for L(m)R(k) grammars. We answer 
these questions next, continuing to use G as an example, and we justify 
our answers afterwards. 

In the general case the left context which must be remembered may 
be anywhere to our left, thus we must search for it all the way back to the 
beginning of the string. In terms of the CFSM this means all the way back 
to the starting state. The procedure for a general LR(k) grammar G whose 
CFSM has an inadequate state N goes as follows. 

We first find the set of strings LL(N) = {<P e V \<p accesses N via 
a path through the CFSM which contains no more than k instances of any 
given cycle of states} . Because our CFSM can be represented via a 
directed graph some of the results of graph theory are appropriate for 
use in computing such paths and the corresponding strings. In fact, well 
known, even fast, techniques exist for doing just that (Pro 59) (War 62). 

In the case of our LR(1) grammar G. the strings are 
f-ae, |-aee, |-be, and j-bee, and they correspond to paths which can be 
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represented by the sequences of state names: 0, 1, 4, 9 (no cycles); 

0, 1, 4, 9, 9, (one cycle, 9 to 9); 0, 1, 12, 9; and 0, 1, 12, 9, 9, 

respectively, none of which contain more than k ■ 1 instance of the only 

cycle in G 's CFSM. 

, R __ , 

Next we compute the set LC(N) = {tpX ™' \<p is in LL(N) and 

R x 

X is a generalized symbol such that there is an X-transition (X not 

* 
a nonterminal) from N and R v = {(k:6|8) c V '\<p 0)3 is a canonical form 

<p, A I 

with characteristic stringp0# , and X = (1:0# )} . 

P R x P 

Each generalized symbol X "' represents the set of terminal 

strings of length k which may follow <p in a canonical form a = <p/3 such that 

the characteristic string of a accesses state N and then takes the X-transition; 

i. e. , it is the look-ahead set corresponding to <p and the parsing decision 

associated with the X-transition. We reference a method for computing 

these sets below. 

R 
For G. the set of suchpX ^' for k = 1 is: 

{ hae# 6 fd3 , haee# 6 {d} , Hbe# 6 {c} . f-bee# 6 {c3 , 
hae# 8 {c} , haee# 8 {c} , |"be# 8 {d3 , hbee# 8 {d3 , 

|-aee {e} , j-aeee {e} , f-bee fe} , |-beee fe} } 

as the reader may compute for himself. 
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We now form a nondeterministic FSM in a manner similar to that 
for an L(m)R(k) grammar. For each string <pX ^» in LC(N) we add to 

the starting state of the CFSM a new path (of new transitions and new 

\ X 
states) under <pX ^' leading to the accepting state. We convert the 

mahine to a deterministic device, reduce it, minimize the strings in the 

look-ahead sets, and presto the appropriate FSM with look-ahead 

sets buHt in; i. e. , the "LRkFSM' 1 . 

To specify the procedure fully we must provide two things: (1) 

a procedure for computing the look-ahead sets implicit in the generalized 

R x 
symbols X "' , and (2) the reason why we need to consider only such 

left contexts (them's) as take paths through the CFSM which contain no more 

than k occurrences of a given cycle. 

Regarding the first point, we use the simple expedient of a reference. 

Knuth (Knu 65,866 especially page 617) has already solved this problem. 

His parsing algorithm in a sense computes the states of our CFSM dynamically, 

as it is parsing a string. However, it also computes much more information, 

all of which is bundled neatly into what are called "state sets". If we 

simply apply his algorithm to each string <p, we can deduce the look-ahead 

sets from the "state set" computed just after the algorithm has read the 

last symbol of<p, as he describes in "Step 2" of the algorithm. (His set 

"Z" is the look-ahead set for all transitions under terminals, and the 

set "Z " is the look-ahead set for a # -transition. ) 
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Regarding the second point, we provide an informal proof. Recall 

the canonical derivation of a form a of CF grammar G illustrated on page 

42 . Let a = tpB where co = U) n U)< . . . co y and B = y"oo ". . . co''to" for some 
^ r 1 m' ^ m 10 

y and y" such that yy' = co and y' -»# y" e V . Note the correspondence between 
left and right contexts: each co.(or y) has a matching cd.' (or y"). In the 
next two paragraphs we investigate the implications of this correspondence 
with regard to the computation of look- ahead strings corresponding to a 
particular cp which accesses an inadequate state N of G's CFSM. 

We consider first a strings spelled out by a path through the CFSM 
which accesses a given cycle only once. That is, <p first accesses a state 
in the cycle, then goes around the cycle several, say r, times, and then 
accesses state N. In this case <p can be written 

co-Co, . . . co.(co_ 1 . . . to. ) to.^ . , . . . co y. 
1 l l+l l+n i+nr+1 m 

I* 

The subexpression (...) cannot include only a part of an to.. Since r can 
have any nonnegative integral value and since there are only a finite number 
of productions, the canonical derivation must also have a cycle in it; i. e. , 
we can write the numbers of the productions used in the derivation 

P.P •••?.(? P. ) P., .*'••& P- But each application of a 

12 i l+l i+« *i+n r+1 *n * *^ 

production in this sequence adds a whole 60. to the left context, never a 

part of one. Thus, the first k symbols of the right context can be written 

k£ = k:y "co "... co" „ A1 (to" . . . w" )V'. . . co" co" 
■ ^ ' m i+ar+1 l+n l+l l 10 
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But if r > k, this is equivalent to k:. ..(... ) . . . , since in the worst case 

where |y "to" . . . to" ^ J = and |to'.' , ...w" J =1 we have k# = 
m 1-+H r+1 ' l+n l+l ' 

(to" . . . to'.' ) . The point is that the look- ahead strings for anycp 

which accesses a cycle r times in succession, where r > k, are exactly 
the same as those for a similar <p which accesses the cycle only k times 
in succession. 

We must now consider the case where <p accesses a cycle once, 
goes around it several, say r , times, wanders around the machine 
elsewhere, returns, goes around the cycle several more, say r , times, 
etc. Because the notation gets out of hand otherwise, we shall argue the 
case for only two separate accesses of the cycle and let the reader 
generalize for himself. In this case /3 can be written 

y v . . . to*: ^ ^ (oj': _, ... oj" 4 . 1 > V . . . w': A 

m 1 2 +n 2 r 2 +1 V n 2 V* l 2 1 l +n i r l +1 

r i 

(ufl . ... to". , , ) to'.' . . . (jo" to " for i > i, + n, r, . In the worse 
l +n l+l i 10 2—1 11 

case where |y"to" . . .to" ^1=0 and Ito" . . .0)!' , ^1=0 and 

' m W 2 +1 *2 1 l +n l r l +1 

! w '.' j. . . . w" , I = 1 and Ito" _,_ . . .to" , | = 1, 
h"**i x i +1 V^ X 2 +1 

r 

we see that if r, > k then k:/3 = (to'.' _,_ to'.* ^ ) 2 (u". _,_ . . .to'. 1 ^ ) r ' 

1- i 2+ n 2 ... x 2 +l ij+nj i 1+ l 

where r' c 

maximum of k - r and zero. Thus, if r > k the look-ahead strings for 

<p are the same as for a similar <p Du t with r = k and r t = 0. However, 
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if < r < k, they are the same as the ones for a similar p but with 
r x = k - r 2 . 

In conclusion (and generalizing), the look-ahead strings for a given 

<p whose path through the CFSM goes around a given cycle a total of r > k 

t 
times are the same as those for a similar^) with only k such loops . 

Therefore, our procedure above which computes look-ahead sets by considering 
only the<p's with no more than k such loops computes all possible look-ahead 
strings. 

Conclusion . Here, as above, it seems clear that these more 
complex techniques need be applied only to inadequate states for which our 
simpler techniques will not work. It is also clear that the procedure for 
converting an LRkFSM to a DPDA is the same as that for an LmRkFSM. 

What we have not provided thus far, however, is a method which 
is convenient to use in the above procedure for deciding if a given grammar 
is LR(k). It should be clear from our informal proof above and the definition 
(2. 2) of LR(k) grammars that a CF grammar G is LR(k) if and only if, for 
each inadequate state N of G's CFSM and each strings in LL(N), the set 
{R _J where is an X-transition (X not a nonterminal) from N] is a set 
of mutually disjoint sets. This, of course, means that the look-ahead sets 



Actually we could do better than this. If all the cycles were "separate" 
from each other in the CFSM, we could consider onlyp's with a total of 
k loops around any cycle. Unfortunately our proof would get excessively 
complicated to cover the case where one cycle is a part of another. We are 
satisfied with the above simple, sufficient condition because our purpose here 
is to show that the task of computing the look- ahead sets is a finite one, not 
to develop a method for computing the sets which requires a minimum of time. 



-129- 

for each inadequate state of the LRkFSM are mutually disjoint. 
5. 6 Comments 

We noted above that Knuth's LR(k) parsing algorithm in a sense 
computes the states of our CFSM dynamically, as it parses a string. 
Actually, we believe this to be accurate only in the case k = 0. If k> 1, 
Knuth's algorithm computes the states of a machine much larger than our 
CFSM. In effect, the processes of splitting states and computing look- 
ahead sets are bound together in his algorithm. Consequently, for k = 1 
the number of states computed is the number of states of the CFSM times 
some number having to do with the number of symbols which appear in 
the look-ahead sets. In practical cases this multiplicative factor is 
unpractically large (Kor 69). Further, the size of the machine increases 
rapidly with increasing k. 

Korenjak (Kor 69) noticed that the multiplicative factor depends upon 
the size of the look-ahead sets, and he proposes a parser-construction 
technique to reduce the effect. He proposes that the grammar be partitioned 
into several sub-grammars, that a sub-parser be generated for each 
sub-grammar by using Knuth's algorithm for each, and that the desired 
parser be constructed by combining the sub-parsers appropriately. Since 
the look- ahead sets for each sub- gramma'' are much smaller than those 
for the entire grammar, the multiplicative factor for each sub-parser is 
much smaller than that for a parser constructed directly for the entire 



-130- 

grammar. Further, a relatively small number of extra states are required 
to combine the sub-parsers. 

In a sense, we have taken Korenjak's approach to the extreme by- 
analyzing the grammar production-by-production; or more precisely, we 
analyze the CFSM inadequate-state-by-inadequate-state. Our method 
seems to cause nearly a minimum of state- splitting and look-ahead. We 
leave as questions for future research, however, whether or not it does 
cause such minimums, and if not, how it could be modified to do so. 
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Chapter 6 
TRANSLATORS 

6. 1 Philosophy 

Thus far we have followed the lead of Knuth, concerning ourselves 
solely with grammatical analysis. However, our interest is ultimately 
in translators rather than parsers. We have addressed the parsing 
problem first because it gives us a convenient basis from which to address 
translation, a fact which will become abundantly clear below when we see 
that our method of specifying translations is based directly on CF grammars. 
It will follow that our translators can be based directly on our parsers. 
We now, therefore, abandon the grammatical analysis approach 
and adopt the philosophy of Lewis and Stearns (L&S 68), namely that 
"implementing a translation should be regarded as an 
automata theory problem of machine capability and 
efficiency rather than as a problem of grammatical 
analysis. " 
We deal only with the capabilities of DPDAs here, so our main concern is in 
improving efficiency by making transformations on our machines which 
preserve their input/output relations. Of course our yen to perform 
transformations must be tempered by the implications of our desire to 
implement the translators ultimately on a modern digital computer. 
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Actually, we have already been abiding by part of this philosophy, 
in preparation for the material in this chapter. In effect, we have 
regarded parsers as translators which translate sentences into parses, 
i. e. , into strings of productions or production numbers. Although we 
found it convenient to discuss grammatical analysis at first from the 
string-manipulation viewpoint, we certainly made it a point to convert 
to the automata- theory viewpoint when we converted our string-manipula- 
tion parsers to DPDAs. 

6. 2 Objective 

It is the objective of this chapter to show how our results are 
relevant to (a) the specification of translations of programming languages, 
and (b) the construction of compilers from those specifications. 

In Section 6. 3 we motivate an interest in string-to- string translators 
similar to our DPDA-parsers. We do by discussing some well-known 
approaches to compiler construction. 

In Section 6. 4 we show why we are not interested in parses, per 
se. We motivate an interest in string-to-tree translators, each of which 
can be regarded as a concatenation of two subtranslators: the first being 
a string- to- string translator which maps input strings into strings (sequences) 
of tree-building directives, and the second being a string-to-tree translator 
which maps strings of directives into trees (by obeying the directives). 
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In Section 6. 5 we present a formalism based on CF grammars for 
specifying string- to -string translations, and in 6. 6 we show how to 
convert our DPDA-parsers to corresponding translators. The latter 
feat is trivial, but some nontrivial optimizations ensue. We formalize 
only string-to-string translators because our (linear) automata- theoretic 
approach seems inappropriate when discussing trees. 

Finally, in Section 6. 7 we present a compiler model,and in 6. 8 
we show the relevance of our results to the specification of languages, 
translations, and compilers; i. e. , to TWSs. 

We emphasize that the only formal results in the present chapter 
are those of Sections 6. 5 and 6. 6. The remainder of the chapter is 
intended as motivation for those two sections and discussion of their 
relevance to TWSs. 

6. 3 Syntax Directed Compilers 

Many compilers in existence today, whether written by hand or 
partially or wholly written by a TWS, are termed "syntax directed" 
compilers. The approach of Cheatham (Che 67) is fairly representative 
for our purposes here. He advocates the use of "augments" to productions 

to enhance the descriptive power of CF grammars so they can be used to 

tt 
specify programming languages fully . These "augments" are in the 



t 

In the sense we have in mind here the term should perhaps be "syntax- 
analyzer directed". 

tt 

What amounts to a generalization and a formalization of this approach can 
be found in (Knu 66). 
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form of "actions", "conditions", and "interpretations" associated with 
the productions. He envisions a parser as an "engine" operating in an 
"environment". As the parser parses a string it drives other mechanisms 
which (a) execute "actions", thus causing the "environment" to change, (b) 
check "conditions" in the "environment", thus providing context- sensitivity, 
and (c) compute "interpretations" ("values", "meanings", or "semantics"). 
The auxiliary mechanisms are activated each time the parser makes a 
reduction s and they then compute the "augments" associated with the 
corresponding production. When the parser has finished parsing the input 
string, intermediate object code has been output via "actions" and any 
relevant tables are available via "interpretations" associated with the 
entire program. 

A basically similar approach is one due to Feldman (Fel 64) in 
which "EXEC n" routines are associated with "Floyd-Evans productions" 
(Eva 65) comprising a parsing program. Roughly speaking, the "EXEC 
n" routines are the analogues to Cheatham's "augments". 

An approach similar to one or the other of these two, or similar 
to our own approach (described below) where an "abstract syntax tree 
or "parse tree" is built, is used in every compiler or TWS effort 
described in (F&G 68). Implicit in and fundamental to the compilers of 
all these schemes is a string-to- string translation: a translation from 
the input string to a string (sequence) of commands to mechanisms to 
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compute "augments", or of calls to "EXEC n" routines, or "semantic 
routines", or "generators", etc. Thus, if the reader is partial to one 
of these schemes in particular, he may think of the "output symbols" 
below as the appropriate commands or calls to routines, and he may 
think of our "CFSTs" as the corresponding string-to- string translators. 
For our purposes we think of the "output symbols" as tree-building 
directives, as we discuss next. 

6. 4 Abstract Syntax Trees 

In previous chapters we devoted much time to the development of 
DPDA-parsers; i. e. , string-to-string translators which map input strings 
into parses. In the present section we discuss the reasons why parses, 
as such, are not as appropriate for purposes of compiling as are the strings 
of tree-building directives referred to above. 

Inefficient coding. There are two problems with parses, per se: 
(1) they contain some information in which we are not interested, and (2) 
the information which we do desire is not explicit. 

For instance, for grammar G the string j- i + i -\ can be reduced 
to )-E + T-|-*|-E-|-S. But for purposes of compiling we do not care 
that the reductions for the first i were i - P - T - E and the second were 
i -» P- T, nor do we care which reductions were made first or which 
particular nonterminals were used. The only information which is both 
implicit in the parse and of interest to us is that one i is the left operand 
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of the operator + and that the other is the right operand. If we were 
mathematically inclined, we might represent this information via a 
functional form; e. g. , + (i, i) or PLUS (i, i). However, for purposes of 
discussing compiling activities and for an explicit representation of the 
"structure" which is implicit in parses, we find it more convenient to 
represent the above information via the following graph (tree): 




Such a graph, representing the "structural" information or relationships 
which are implicit in a parse, has been called by some an "abstract 
syntax tree" (W&E 69, Lan 66, McC 66). We elaborate on the reasons 
for this name in Section 6. 7. 

Now, (a) if we are not interested in all the information implicit 
in a parse, it would be inefficient for our compiler to generate it. Further, 
(b) if an abstract syntax tree represents all and only the information 
implicit in the parse which is of interest for further compiling activities 
and (c) if the tree can be represented in some convenient and useful way 
in a computer, then our results would be more useful if we could show 
(1) how to specify a translation from strings to trees in a manner based 
on CF grammars and (2) how to convert our parsers to efficient translators 
which affect the corresponding, string-to-tree translations. 
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Conceptual modularity. Although if-clauses (a) and (b) of the 
preceding paragraph probably represent good assumptions, (c) is 
subject to some question, partly because it is not clear that the 
abstract syntax tree, per se, ever needs to be built during the compiling 
process. But we do not let this stop us for the following reason: even 
if our compiler does not actually construct an abstract syntax tree, we 
can regard the process conceptually as building it. 

We argue that even the string- to- string translator which we 
develop below can be regarded as the concatenation of two subtranslators, 
the first being a parser and the second affecting a translation from parsers 
to the desired strings. However, after we have thoroughly investigated 
the two subtranslators, we see that they can easily be combined so as to save 
us actually having to generate the parse. 

Similarly, we can regard preliminary compiling activities as 
performing a translation from input string to abstract syntax tree and 
subsequent activities as performing a translation, again conceptually 
composed of several subtranslations, from abstract syntax tree to object 
code . The advantage of this approach relative to a less modular one is, 
of course, that the otherwise complex task of compiling is broken into several 
relatively simple subtranslations. Hopefully, when we are finished analyzing 



f This approach was largely inspired by (W&E 69) which in turn was based 
on (Lan 66). See Section 6. 7. 
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the subtranslators separately, we will be able to see how to put them 
together in such a way as to minimize redundancies. This may mean 
that the abstract syntax tree, per se, need never actually be constructed. 

Example . As an example of what we mean by "tree-building 
directives", consider the following. If our string-to-string translator 
maps the example string j- i + i -) of above into the string i i +, we can 
regard the latter as the following sequence of directives: build a terminal 
node with name i; build another terminal node with name i; build a non- 
terminal node with name +, with right (or second) son the last node 
built, and with left (or first) son the next- to -the -last node built. 

In general, if our tree builder is always to construct nonterminal 
nodes whose sons are the last few nodes constructed, and in the same 
order, the sequence of directives must be a linear representation of the 
tree which is commonly called a "suffix form". (See (Che 67) for a 
thorough discussion of the correspondences between trees and their 
linear representations. ) Further, the device can keep track of the nodes 
it has built by maintaining a push- down stack of pointers to them, and 
the pushing and popping of this stack will occur in a sequence closely 
corresponding to that of the stack of our DPDA'translator which issues 
the directives. Our compiler model and another example below should shed 
more light on this subject. 
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6. 5 Transduction Grammars, Translations 

We now get down to business. As our method of specifying string- 
to- string translations we choose a technique which is based on CF grammars 
and which fits naturally and conveniently with our notions about both grammars 
and automata. The first and fourth paragraphs below are taken almost 
directly from (LAS 68). 

A transduction grammar G based on a CF grammar G is a triple 
(G, V' , g), where V' is a set of output terminals and g is a mapping 
defined on G which associates a string <*J ! in (V' u V )* with each 
production A -» ui in G and which specifies a one-to-one correspondence 

that pairs each instance of a nonterminal in CO with an instance of the same 

t 
nonterminal in to'. We refer to the string a) 1 as the transduction element 

for production A -♦ co. 

We are interested, for the present at least, only in simple suffix 

transduction grammars (SSTGs), since they are trivially adaptable to our 

results thus far. "Simple" means the corresponding nonterminals are 

in the same order into and oj\ "Suffix" implies the additional stipulation 



A similar definition in which "translation rules" were associated with the 
alternatives of Backus Naur Form definitions appeared in (Eva 65). 

tt 

We use "suffix" where "Polish" was used in (LAS 68) because it is more 

specific. Also, for those readers who like to reference "semantic routines" 

via output symbols in the middle of the right parts of productions, it is shown 

in (LAS 68) that for many simple transduction grammars based on LR(k) 

grammars there are "structurally equivalent" SSTGs which define the same 

translation and which are based on LR(k') grammars for some finite k' > k. 
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that the nonterminals in 60' must all be to the left of any output terminals. 

An example SSTG G t based on our example grammar Gj is 
as follows, where the transduction element for each production is in 
brackets to the right of the production. 

(0) S - h E -J {E} (4) T- P {P} 

(1) E - E + T {ET+} (5) P-» i fi] 

(2) E - T {E} (6) P- ( E ) {E} 

(3) T - P t T {PTt] 

The transduction elements may be thought of as defining an 
output grammar G 1 , where production A - to' is in G' if and only if co' 
is the transduction element for production A - oo in G. Each derivation 
from S using G has a corresponding derivation using G' which is obtained 
by applying corresponding productions to corresponding nonterminals. 
Thus, for each derivation of a sentence 77 in L(G) there is a corres- 
ponding derivation leading to a string T]' in (Yp*. The string 77' is 
called a translation of 77 induced b y G . 

Our example SSTG G above induces translations of strings in 
L(G ) which are commonly called "suffix forms" (Che 67). For example, 



the 



translation of tj 1 = f-iti + iH induced by G u is 77. - iit i +. 
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6. 6 Translators 



We now show that the translations induced by an SSTG G = (G, V' , g) 

are in one-to-one correspondence to the parses of the sentences in L(G). 

Consider a canonical derivation S = a -* a -\ . . -» a = 77 of a sentence rj using 

grammar G. If step i (1 < i < n) is the application of production p., whose 

transduction element is co' = y 6 where y is in V„ and 6 is in 

P i P i Pi Pi N Pi 

(V' )*, then the translation of r\ induced by G, is r?' = 6 6 ... 6 . 

1 t P P i P, 

Thus, if we were given the reverse of the sequence of productions used in 

a canonical derivation of r), i. e. , if we were given the canonical parse of 

r], we could generate its translation 77' directly in a left-to-right manner 

by outputting first 6 , then 6 , . . . , then 6 . That is, we can 

P n P n -1 Pi 

generate the translation tj* of the sentence r) simultaneously with the parsing 
of 77. 

A machine is called a translator for a transduction grammar 
G t = (G, V^,, g) if and only if (1) it is a recognizer for L(G) and (2) it 
maps each string in L(G) into its translation induced by G . Clearly, our 
DPDA parser for G becomes a translator for an SSTG G based on G if 



t 
This, of course, is our formal, automata- theoretic definition of a 

translator. Below we distinguish these from other translators (in the 

informal sense) by calling them "context-free syntactical translators 

(CFSTs)". 
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for each production p with transduction element 60' = yd, where y is in 
V* and 6 is in (V' )*, each "out p" is changed to "out 6". 

Optimizations. The translator which results from this trivial 
transformation may have points where optimizations are applicable. We 
consider first a "local" optimization and then a "global" one. 

Consider the conversion of the parser of Figure 4. 3 (page 85) 
to a translator for the SSTG G u above. The transitions from states 6 and 
16 become under "pop 0, out e"» or equivalently, under "do nothing". 
Thus, the two states and the transitions are unnecessary, and they may 
be eliminated as follows. In the case of state 16 the look-ahead transition 
from state 7 may be redirected to go directly to state 15. In the case of 
state 6 the transition under "top 1" from state 15 may be redirected to go 
directly to state 17. But the latter results in a look-back transition to a 
state which is itself a look-back state. Clearly, if the "top l" transition 
from 15 to 17 is taken, then the "top l" transition from 17 to 2 will 
also be taken. Thus, we may redirect the "top l" transition from 15 
again, this time to go directly to state 2. The result of applying these 
changes to the DPDA of Figure 4. 3 is depicted in Figure 6. 1. 

We do not give an exhaustive list of all possible types of "local" 
optimizations which may be applicable after a parser is changed to a 
translator. Suffice it to say that (1) all such optimizations arise when a 
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transition is found to be unnecessary due to its being under "pop 0, out c", 
and (2) whenever a transition is redirected to a new state, an analysis of 
the device is in order to detect any redundancies, as in the case above, 
in its actions immediately after taking that transition. 

Unfortunately, the efficiency of our DPDA as a translator is likely 
to be lower than it was as a parser, notwithstanding the above "local" 
optimizations. The problem is that our DPDA still goes through the 
motions of parsing but does not output anything along with many of its 
actions. This is not immediately obvious from our running example, but 
by analyzing it somewhat and generalizing we can illuminate the problem. 

Consider the actions of the translator of Figure 6. 1 associated 
with states 7 and 15. The decisions which are made there can be described 
in terms of operator precedences and associativities as follows. 

Encoded in state 7 is the information that t is a right associative 
operator and it has more binding power than any other operator of the 
subexpression which is implicitly stored in the stack when the machine 
is in state 7. Thus, when the machine is in state 7, if an t is the next 
symbol in the input string, it should be read. The look-ahead set H , +, ) } 
is just the set of other operators which may be the next symbol and which 
have less binding power than t . In case the next symbol is one of 
these, the device should not read but enter state 15, where it makes 
decisions regarding the past rather than the future. For instance, if it 
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has recently read ... i f i, it makes a reduction and outputs t, again because 
t is more binding than the operators in the look-ahead set. Similarly, 
if it has recently read ... i + i, it makes a reduction and outputs +, 
because + is left associative and more binding than -j or ). 

Now, if in our programming language there are many operators 
and many levels of precedence, it will happen that our translator, in 
translating a simple string like \- i -\ , will have to proceed through a 
cascade of pairs of states like 7 and 15. In effect, each pair of states 
will be associated with a precedence level. The first state will look ahead 
to check the precedence of the next operator to be read, and the second 
will look back to see if it should make a reduction and output. Of course, 
the decisions will be made relative to the precedence level associated with 
the pair. 

The point of our generalization is that for a simple string like 
(- i -\, many state transitions, look-aheads, and look-backs may have to 
be performed before reaching the accepting state, all for an output of 
the single symbol i. Of course, the problem can be equally bad with 
parenthetical expressions such as . . . (i) . . . and the inefficiency also 
creeps into a lesser extent with all subexpressions; e. g. , once a 
subexpression with one operator has been translated, the translator 
will have to proceed through the cascade from the level of precedence of 
that operator to the top. 
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To eliminate such inefficiencies we could, as in previous cases, 
precompute all possible results and "wire them into the machine". In 
this case this would mean modifying each look-ahead and look-back 
state so that the machine would, in effect, jump as far up any such 
cascade as it should given the next symbol(s) in the input string and the 
top state-name on the stack; i. e. , given the relevant information about 
left and right context. Unfortunately, if we do this for a grammar of 
practical size and usefulness, the state diagram representation of our 
translator is likely to get disturbingly large. We suspect, however, that 
some clever coding tricks can be employed to implement these "jumps 
over cascades" in a reasonable amount of space. We do not pursue 
the subject here, since our objective is not to develop a "fine-tuned" 
implementation technique. Rather, we leave the problem as one for 
future development. 

The reader should notice that in the case of our example grammar 
G. this "global" optimization amounts to noticing that the string |- i H 
can be reduced directly to |- E H without going through Y P -\ and 
|" T -|. However, he should also notice that this depends on the fact there 
are no output terminals in the transduction elements of the two productions 
T -* P and E -» T. Since in general such productions could have output 
terminals, i. e. , since transduction grammars give us that flexibility, 
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it is clear that we must wait until the translator is constructed, or at 
least until the SSTG is investigated, before attempting to make such an 
optimization. 

6. 7 A Compiler Model 

Our compiler model is an incomplete one. Indeed, we detail only 
the "front end"; i. e. , the first three subtranslators and their interconn- 
ections and interactions. The model is similar to Cheatham's (Che 67), 
but much of our viewpoint and terminology are inspired by the approach 
of Landin (Lan 66) to programming language design. Landin's method 
goes something as follows. 

A programming language is first designed on an abstract level. 
That is, the designer first decides what are to be the primitives of the 
language, what abstract objects are to be in the universe of discourse of 
the language, how things are to be defined in terms of other things, i. e. , 
what sort of definitional facilities are to be available, what sort of 
"structure of expressions" or "linguistic constructs" are to be available 
and how they are to be interconnected for the manipulation of abstract 
objects, etc. At this "abstract syntax" level programs in the language 
are represented by abstract syntax trees. Then the designer provides 
two functions: (a) one to define the mapping or "flattening" of abstract 
syntax trees into a convenient representation for use by programmers, 
i. e. , "source code", and (b) the other to define the flattening of the trees 



-148- 

into representations convenient for use by a computer, i. e. , "object 
code". 

Of course, we do not believe that any language has ever been 
designed in a single iteration of the above procedure, but the procedure 
seems to us a good model of the process which designers go through 
repeatedly before finally settling on a particular design. At the least, 
it provides a model of how the language might ideally have been designed 
and it suggests an intuitively reasonable method of formalizing programming 
language specifications (W&E 69). 

In view of the above procedure, then, compiling can be regarded 
as first performing the reverse of mapping (a) above and then performing 
mapping (b). The two tasks correspond exactly to the "front end" and 
the "rear end" of our compiler model, respectively. 

Landin subdivides the first of these mappings into two mappings, 
and we further subdivide one of them into two, so that the "front end" 
of our compiler consists of three subtranslators. We illustrate the 
corresponding mappings with the aid of Figure 6. 2, in which are presented 
four representations of a program in a programming language based on 
grammar G . From the viewpoint of compiling, the mappings are as 
follows. 

The first is from what Landin calls the "physical" level to what 
he calls the "logical" level. The "physical" level is the level at which 
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Figure 6.2. A program at four different levels i (a) the 
"physical" level, (b) the "logical" level, (c) the "tree- 
building directive" level, and (d) a graphical then a tabular 
representation at the "abstract syntax" level. 
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the programmer uses the language (Figure 6. 2a). The "logical" level is 
the level at which certain strings of characters have been recognized 
as' textual elements" denoting single entities. The strings might denote 
constants, names, operators, key words, or the like (Figure 6. 2b). This 
mapping is often called "lexical analysis". We call the corresponding 
translator a lexical translator . It maps strings of characters, provided 
by a programmer via some input device, say, into strings of lexical 
tokens. The latter are the terminal symbols of a corresponding CF 
grammar, some with certain "semantics" (values, types, etc. ) associated 
with them. 

The second mapping is from the "logical" level to what we call 
the "tree-building directive" level. This mapping is performed by our 
translator of section 6. 6, which we call here a context - free syntactical 
translator (CFST) to distinguish it from other translators (in the informal 
sense). The mapping results in a string of tree-building directives, some 
of which have "semantics" associated with them as do some of the terminal 
symbols (Figure 6. 2c). 

The third mapping is from the "tree-building directive" level 
to the "abstract syntax" level. It is performed by an abstract- syntax 
tree builder (ASTB) and it results in an abstract syntax tree having 
"semantics" associated with some of its nodes (Figure 6. 2d). (For present 
purposes ignore the tabular representation of the tree; we discuss it below. ) 
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Our compiler model, with emphasis on the "front end", is 
illustrated in Figure 6. 3. Note that the "rear end" consists of the 
several subtranslators which are in the box labeled EVERYTHING ELSE 
and which affect the mapping from abstract syntax tree to "object code". 
The box labeled ERROR is intended to be a general error recovery 
device; it is called when any other device in the compiler discovers 
that the program being compiled is not in the given language. 

The boxes labeled LEX and DICTIONARY and the two queues 
together form our lexical translator. LEX is basically an FSM which 
can be automatically constructed via the technique of Johnson, et al 
(Joh 68 ), also see (LAP 68) if the method of specifying the "lexicon" of 
the language is based on regular expressions. When LEX is activated 
it reads from the source code the next string of characters which repre- 
sents a single entity, i. e. , the next textual element, and it outputs one or 
two things: (1) to our CFST, via the "syntactic queue" Ql which is 
necessary for look-ahead, it sends the terminal symbol t which is the 
"name" of the element just found, e. g. , i for the identifier Abe or 123, 
(2) if the string must have some "semantic" information derived from it, 
LEX sends both the "name" t and the string of characters to DICTIONARY. 
The latter then derives the appropriate information from the string, stores 
the information in the TREE STORAGE TABLE (TST) as a terminal node, 
e. g. , lines 0, 1, and 2 of the TST of Figure 6. 2d, and sends a reference 
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to the node (TST line number) to the "semantic queue" Q2. Thus, it is 
actually DICTIONARY rather than the ASTB which constructs terminal 
nodes with associated "semantics". 

Within DICTIONARY there is a NAMELIST in which names, e. g. , 
Abe, are stored, and references to the appropriate entries in NAMELIST 
are stored in the TST rather than the names themselves. This is for 
the sake of fast name comparisons and other reasons regarding "attributes" 
of names which are irrelevant for our purposes. 

Our CFST uses Ql as its input tape, and LEX is activated to 
refill Ql whenever it has insufficient symbols for a read or look- ahead 
by the CFST. That is, in effect, when the CFST desires to read or look 
ahead it makes the appropriate request of Ql. If Ql has insufficient 
symbols to fill the request, it in turn requests the number it needs from 
LEX. As indicated in the figure, LEX deposits symbols into the top of 
Ql and they are removed from the bottom via reads by the CFST. As 
noted in Section 2. 3 we assume that the program which loads the source 
code onto the input tape assures that the last symbol is a -j , so that the 
compiler will not read past the end of the source code, and therefore, 
will stop after some finite time. 

The dashed line in Figure 6. 3 indicates that the two queues are 
"ganged", in an important sense. We have already seen that, when 
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LEX processes a textual element with "semantics", both Ql and Q2 
receive a new item. Likewise, as we shall see in the next paragraph, 
these pairs are removed from the queues simultaneously also. Thus, 
although at a given time there may not be as many references in Q2 
as there are symbols in Ql (because some symbols have no "semantics"), 
the order of the references in Q2 is the same as the order of their 
corresponding symbols in Ql. This correspondence is seen to be 
important below. 

Let us refer to terminal symbols with associated "semantics" 
as "pseudo-terminals". We require that pseudo- terminals be distinguishable 
from terminals without "semantics", a not unreasonable restriction for 
our purposes. Whenever a pseudo- terminal is read from the bottom of 
Ql, the latter sends a signal to the ASTB which causes it to remove the 
bottom reference from Q2 and to push that reference on its stack, the 
"node -reference stack". It is this stack which the ASTB uses to hold 
references to the top nodes of pieces of a partially constructed abstract 
syntax tree. Thus, immediately after the CFST reads a pseudo- terminal, 
the top reference on the ASTB's stack is to a terminal node which corres- 
ponds to that pseudo-terminal. 

Summary . In summary, the lexical translator (LEX plus 
DICTIONARY, Ql, and Q2) reads the "source code" and translates it 
into a string of symbols, some of which have associated "semantics". 
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Each time the CFST reads a pseudo-terminal a reference to a corresponding 
terminal node with "semantics" is pushed on the ASTB's stack. Each time 
the CFST outputs a symbol (i. e. , .a directive to the ASTB to build a 
nonterminal node), the ASTB pops the appropriate number of node-references 
off its stack, builds the appropriate nonterminal node whose sons are the 
nodes whose references were just popped, and pushed a reference to the 
new node on its stack. 

In a sense, then, the language designer's problem is, in part 
(1) to design a transduction grammar such that the corresponding CFST 
issues the appropriate directives at the appropriate times, and (2) to 
specify an ASTB which constructs the appropriate trees, given that as 
the CFST reads pseudo- terminals the ASTB will be directed to build 
corresponding terminal nodes with "semantics". Of course, stated that 
way the design problem sounds like a fairly "low level" task. Our next 
order of business, then, is to transliterate this task of specifying CFSTs 
and ASTBs into one which can be performed at a "high level". This 
requires that we return to our approach to language design and work 
from there down to the level of tree-building directives. 
6. 8 Specifying Languages, Translations, Compilers 

We have chosen to employ CF grammars as aids to that part of 
language specifications which we describe, after Landin, as specifying 
the mapping of abstract syntax trees into strings of lexical tokens. In 
more common parlance: we use a CF grammar both to define a set of 



-156- 

potentially legal programs, some of which may be screened out by context 
sensitive checks, and to define certain operator precedences, associativites, 
etc. , by building into the grammar certain "structural properties". 
Unfortunately, due to the nature of CF grammars we usually get too much 
"structure" (at least from the viewpoint explained below). We therefore 
propose the use of something very like an SSTG for specifying only the 
amount of "structure" we desire. We elaborate on this subject by first 
considering just what "structural" information is implicit in a parse. 

Consider a variation of our compiler model. Let us assume for 
the moment that every textual element is sent to the DICTIONARY to 
have a corresponding terminal node built from it. If the element has no 
"semantics", then the node will just be a simple terminal node with no 
"semantics" and with the same name as that of the element. Further, 
let us assume that every read by the CFST causes a corresponding 
terminal node reference to be pushed on the ASTB's stack. Finally, 
let us assume that the CFST is replaced by the parser for the grammar 
at hand, and that the ASTB is simply a collection of subroutines associated 
with the productions such that when the parser outputs production 
p, A -» oo, a corresponding subroutine is activated whith pops |co| 
references from the node reference stack, builds a nonterminal node named 
A with |<o| sons which are the nodes corresponding to the references just 
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popped, the first popped being the |co| -th son, and then pushes a reference 
to the new node on the node reference stack. 

If this device is applied to some legal program, then after the 
parser has made the final reduction, namely to S, there will be a single 
reference on the ASTB's stack and it will be to the top node of what is 
commonly called a "parse tree". The parse tree contains the same 
information as the parse but the "structural properties" are explicit rather 
than implicit. As an example the parse tree corresponding to our string 
h i + i -\ generated by G 1 is as follows. 




Now, if our language designer has been careful to design into his 
CF grammar all the "structural properties" he desires, then the abstract 
syntax trees can be derived from the parse trees by removing any spurious 
structure which may have crept in, and perhaps also "recoding" the informa- 
tion slightly, e. g. , by renaming nodes. This follows by definition of what 
we mean by the above phrase "design into . . . desires. " That is, we view 
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this design problem as one of constructing a grammar which generates 
strings having parse trees from which the desired abstract syntax trees 
can be easily derived as just described. 

Thus we must provide the language designer with a way to specify 
what information to keep, what to discard, and how to "recode" any of the 
information in the parse trees. One way he could do this would be on a 
node-by-node basis with respect to parse trees, and therefore, on a 
production-by-production basis with respect to his CF grammar. In 
effect, he could specify replacements for the subroutines which comprise 
the ASTB so nodes would be constructed differently. For instance, for a 
production like E -» T he might replace the corresponding subroutine with 
one which does nothing, so that a node named E with only one son named 
T would never appear in the resulting tree. Similarly, he might change 
the subroutine for E - E + T to one which creates a node named + with 
two sons. 

We place only two restrictions on the designer with respect to his 
new subroutines. The first is really just a matter of the efficiency of 
our compiler. It is inefficient for us to build terminal nodes for textual 
elements with no "semantics" and to carry references to them on the 
node reference stack, because the designer may have no need for them 
in his tree, and even if he does, he can easily build them himself. Thus, 
he should be aware that only references to nodes corresponding to 
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pseu do -terminals and nonterminals in the right part of a given production 
will be at the top of the stack when the corresponding subroutine is called. 
The second restriction is more severe than is necessary, but it is simple 
and it still allows adequate power for the purpose at hand. To be sure 
that references in the ASTB's stack are always kept in the appropriate 
correspondence with pseudo- terminals and nonterminals in the right parts 
of productions, we require that any new subroutine have the same effect 
relative to the node reference stack as does the one it replaces; i. e. , 
if the original subroutine, or really the original modified to abide by the 
first restriction, pops n references and pushes one, then the new sub- 
routine must pop n references and push one , unless n = 1, in which case 
it may do nothing. Again we have a not unreasonable restriction, given 
the application. 

A proposal . Now, we hope the reader has not taken the above 
discussion too literally. It was intended to illuminate the specification 
problem associated with our CFST and ASTB. We do not, however, propose 
that the designer should actually think of himself as modifying our compiler, 
or necessarily, writing any subroutines, per se. Having gone through 
this discussion though, it should be easy to see that the following proposal 



We might have allowed simply pop n - 1, but pop n - 1 implies that some 
information is being discarded. We assume that, if n > 1 references are 
popped, then a reference to a new node will be pushed such that the new 
node has at least the n corresponding nodes as sons. 
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will serve as the desired "high level" specification of CFSTs and 
ASTBs. 

We propose that the language designer specify a correspondence 
between strings generated by his CF grammar and abstract syntax 
trees merely by associating tree nodes with his productions. For 
example, for the + operator we might have 

E 

E - E + T 



iAt 



and for a production with no corresponding node we might have 

E 

E- T \ 

T 

Our second restriction above merely implies that for each instance of 
a nonterminal or pseudo-terminal in the production there must be a 
corresponding instance in the corresponding node. Thus, we have a method 
of specification rather similar to a transduction grammar. In fact, if we 
settle on some conventions about diagrams like the above, i. e. , if we 
develop a graphical language for this purpose, a set of node building 



To specify the language BASEL Jorrand (Jor 69) uses the AMBIT/G 
graphical language (Chr 67) to specify the "augments" to productions. His 
approach is an adaptation of Cheatham's and is similar to but more extensive 
than our proposal. 
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subroutines and a corresponding SSTG can be derived from a set of such 
"production-node pairs", such that the corresponding CFST and ASTB 
are the appropriate ones for a corresponding compiler. For instance, 
corresponding to the above two examples would be the following components 
of an SSTG: 

E -> E + T [ET + ] 

E - T {T} 

and a subroutine called PLUS, say, which would be activated when the 
CFST outputs + to the ASTB. PLUS would pop two references off the 
node-reference stack, use them to build a node named + with two sons, 
and push a reference to that node back on the stack. Of course, we have, 
in effect, made an optimization with regard to the second production: 
rather than have our CFST output a call to a nugatory subroutine, we have 
it not output anything when the reduction T -» E is applied. 

TWSs . Ideally, then, the portion of our TWS which builds the 
"front ends" of compilers would consist primarily of (1) a device which 
translates a specification based on regular expressions into a LEX and a 
DICTIONARY, (2) a compiler which translates a set of production-node 
pairs into a set of node-building subroutines, i. e. , the ASTB, and an 
SSTG, and (3) a manifestation of our procedure (summarized in Chapter 
7) for constructing a CFST from an SSTG. 
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Of course the latter component is useful only if language designers 
find it possible, natural, and convenient, to specify a significant portion 
of the translations of their languages via techniques similar to those we 
have proposed. More specifically, the value of our results depends on 
designers being able, once they have a set of abstract syntax trees in 
mind, to construct an LR(k) grammar which implies parse trees from 
which the abstract syntax trees can be easily derived. Of course, it would 
be even better if the grammar were SLR(l). 

Unfortunately, we know of no significant formal results in this 
area. Currently, designers seem to build operator precedences, etc. , 
into grammars purely on the basis of past experience and trial- and- error 
methods. We have pursued the research, then, only because of empirical 
evidence that some related results may be forthcoming. We hope because 
so many authors (FAG 68) have found LR(k) grammars useful in this way 
that there are some underlying principles which will some day come to the 
fore. 

Conclusion. We conclude by further illustrating the similarity of 
our model to those of other authors. To do so we consider the absorption 
by the ASTB of some of the tasks conceptually performed by the "rear 
end" of our model. 

As we have already seen, the ASTB can be regarded, even implemented 
as a collection of subroutines. Consider for example our subroutine PLUS 
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of above. It could be a much more sophisticated routine than we have 
indicated thus far. For instance, it might check the two sons of the 
node it would build to determine if they are both constants, or of one is 
zero, and if so, perform the addition, i. e. , prune the tree; it might 
reorder the sons in some way so that ultimately more efficient "object 
code" would be generated; it might do "type -checking" ;. . . ; it might 
even be able to perform the entire function of the "rear end" with 
respect to the node in question and actually output object code. 

It should be clear, then, how similar our approach is, basically, 
to the approaches of Cheatham and Feldman 
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Chapter 7 
IMPLEMENTATION ISSUES 

We seek in this chapter to illustrate the practicability of our 
scheme. To do so we choose a particular method of implementing 
our translators and present the results when the method is applied to 
a particular, practical transduction grammar. Our implementation 
should be regarded only as a first approximation to an optimal one. 
We have not labored at getting an optimal solution, but only at getting 
one which would illustrate the potential of our methods. Undoubtedly, 
some empirical results would be invaluable aids in "tuning up" our 
implementation. 

Before presenting our practical example we discuss further the 
construction of CFSMs and then we summarize our translator constructing 
technique as a whole. 

7. 1 Constructing CFSMs 

The CFSM of a CF grammar G can be constructed from the productions 

of G in a manner similar to the well known technique for constructing an 

FSM from the productions of a right linear grammar. (See for example 

(D&D 69) for a thorough discussion of the latter technique. ) We review the 

technique here because our technique is derived from it. 

The productions of a right linear grammar G„ are either of the 

R 

form A -* a a . . . a or A -» a. a n . . . a B where n and m are > 0, the a. 
12n 12m — ' i 
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are terminals, and A and B are nonterminals. We construct an FSM 
which recognizes the strings generated by G by forming a small piece 
of the machine for each production and then putting the pieces together. 
For the production A -* a a . . . a the corresponding piece is 



A 



-C=i 



-!X' 



^ 



^*e 



that is, a path which spell out a a . . . a and leads from a state named 

A, the left part of the production, to the terminal state. For a production 

of the form A ~* a, a . . . a B the corresponding piece is 
12m r s ^ 



A 



-d 



-o 



m 



that is, a path which spells out the string of terminals in the right part 
and leads from a state named A to a state named B. If we simply put all 
the pieces together by identifying all states with the same name as the 
same state, we get the desired FSM, although it may be nondeterministic. 

Now, to build our CFSM we could just apply the above procedure 
to G's characteristic grammar. However, since that grammar is so 
closely related to G, we can transliterate the procedure to one which will 
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work directly on G. We illustate the procedure using our example G . 
Consider the production (1) E -* E + T. There are three corresponding 
productions in G 's characteristic grammar, E' -» E + T # , E' -♦ E + T\ 
and E' -» E'. The corresponding FSM pieces are as follows. 



1E 1 



E' 



E 



E 



^l 



-£>i 



~i + r 
£> 



T 



^T\ — l -zJn 



t ^ r 



Oi T' 



f — I 
lE'h 



O E" 



In the latter case we visualize the production written E' -» e E' so it fits 
the second rule above. If we now combine all the pieces corresponding 
to the single production of G , just as we would do if they were all of 
the pieces, and change the result to a deterministic (piece of) FSM, we 
get the following. 



E' 



E 



-£> 



T 



OT' 



#. 



*>□ 



It is easy to see that, in general, the piece corresponding to a production 

(p) A -* co consists of a path which spells out o# and leads from a state 

P 
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named A to the terminal state, such that from each state in the path 
having a transition under a nonterminal B there is also an e -transition 
to a state named B'. If all the pieces corresponding to the productions 
of G are put together by identifying all states with the same name as the 
same state, an FSM with c -transitions results which recognizes the set of 
characteristic strings. The c -transitions can be removed by well-known 
techniques (again see (D&D 69)) and the machine can be made deterministic 
and reduced. The result is the desired CFSM. 

7. 2 An Efficient Translator Constructing Procedure 

We now review our procedure for the construction of a translator 
from an SSTG G based on a CF grammar G. The review is rather terse, 
being presented as an imperative "English program" with simple, forward 
jumps. Our purpose is to summarize the procedure as a whole and to point 
out the general order in which things might be done in a TWS. The order 
suggested here is largely a result of our experience with our single example 
presented below and should therefore be to some extent "taken with a grain 
of salt". Also, since the most useful TWS is undoubtedly an interactive 
one, some of the decisions built-in below should probably be made variable. 
Certainly, more empirical results are necessary for the development of 
an optimum strategy. 
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The translator constructing procedure is as follows. Note that 
we have referenced pertinent definitions, theorems, sections, and page 
numbers. 



START: 



LR(0): 



SLR(l): 



SLR(k): 



Generate G's CFSM (Section 7. 1). In the process do 
the following for future purposes: (1) for each non- 
terminal A record in a "nonterminal- transition 
table" all pairs of states such that there is an 
A-transition from the first to the second, (2) note 
whether there are any inadequate states and if so 
which, and (3) associate with each production p a 
"set of p-states", those states which have # -transitions. 
If the CFSM has no inadequate states then G is LR(0) 
(Theorem 3. 4), so go to COMPUTE LOOK-BACK (below). 
For each inadequate state N compute the simple i-look- 
ahead sets (Definition 4. 2) for the transitions from N. 
If these sets are mutually disjoint for each such state, 
then G is SLR(l) (Definition 4. 3) so convert the CFSM 
to the SLR1FSM (Definition 4. 4) and go to COMPUTE 
LOOK-BACK. 

For each inadequate state N with overlapping simple 
1-look-ahead sets compute the simple k-look-ahead 
sets (Definition 4. 2) for the transitions from N for the 
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--SLR(k): 



LmRk: 



largest value of k for which we are willing to implement 

a translator with k- symbol look- ahead. (This value of 

k is probably dependent upon the number of such states, 

the implementation, and perhaps the language designer, 

if the TWS is interactive. Empirical results are needed 

here.) 

If these sets are mutually disjoint, G is SLR(k) 

(Definition 4. 3) so minimize look-ahead (Section 4. 4), 

convert the CFSM to the SLRkFSM (Definition 4. 4) and 

go to COMPUTE LOOK -BACK. 

Report to the language designer that his grammar is not 

SLR(k) for an acceptable k. Provide him with some 

information regarding what kinds of strings need more 

than k- symbol look-ahead and/or state- splitting to determine 

their characteristic strings. (Empirical results are needed 

regarding what information is useful to the designer. ) 

Then, if the designer so desires, continue with the more 

complex techniques which follow. 

For each inadequate state N which has overlapping simple 

k- look- ahead sets, choose the above value of k and a 

similarly maximal value of m and compute the sets of 

(m, k)- bounded- context pairs (Definition 5. 3) for the 

transitions from N. 
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LR(k): 



'LR(k): 



If these sets are mutually disjoint, G is L(m)R(k) 

(Definition 5. 4) so convert the CFSM to the LmRkFSM 

(Definition 5. 5) with minimum-look-ahead (section 4. 4), 

change the "nonterminal- transition table" and the "sets 

of p- states" (see START above) appropriately so that 

they reflect the new states and transitions, and go to 

COMPUTE LOOK-BACK. 

For each inadequate state N with overlapping context 

pairs and for k as above, compute the strings <p (page 123) 

which access N via paths with no more than k instances 

of a given cycle, then compute the look- ahead sets 

corresponding to each suchp (pagel24)and each 

transition from N. 

If these look-ahead sets are mutually disjoint for each 

such N, G is LR(k) (pagel28)so convert the CFSM to 

the LRkFSM (page 125) with minimal look-ahead 

(section 4. 4), change the "nonterminal- transition table" 

and the "sets of p-states" appropriately, and go to 

COMPUTE LOOK -BACK. 

Otherwise, G is not LR(k) for an acceptable k so reject 

G and provide the language designer with some information 

regarding what kinds of strings need more than k symbols of 

look-ahead to determine their characteristic strings. 
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COMPUTE 
LOOK- 
BACK: 



XLATOR: 



ADD 
LOOK- 
BACK: 



ADD 
LOOK- 

AHEAD: 

OPT: 



Associate with each #-transition in the FSM a 'look-back 
set" of state pairs (the set Q on page 54), for the compu- 
tation of look-back transitions below. For a # -transition 

P 

from state R, where production p is A -» to, the set is as 

follows. If there is but one # -transition in the machine, 

P 

the set is the set of pairs associated with A in the "non- 
terminal transition table". Otherwise, the set is the 

« 

subset Q of A's set such that for each pair (N, M) in Q 

there is a path from N to R which spells out U). 

If production p has transduction element w 1 such that 

(jo* = y6 and y = V and 6 = (V')*, replace the # -transition 

with one under "pop |to|, output 6 "(page 142) to a new state 

R'. 

R 1 (page 54) has a transition under "top N" to state M for 

each pair (N, M) in the "look-back set" associated above 

with the # -transition. Eliminate equivalent look-back 
P 

states (page 60). 

Convert each inadequate state (if any) to a look-ahead state 

(Figure 4. 2). 

Optimize the DPDA by (a) deleting transitions under 

nonterminals (page 56) and "pop 0, out c" (page 142) 
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(b) eliminating redundancies via precomputation 
(page 144), by minimizing look-back, pushing, 
and popping, (page 61), and (c) precomputing jumps 
over cascades of look-ahead and look-back states 
(page 146). 
END: All done. 

We emphasize that we expect most of the grammars of interest 
to be SLR(l), the remainder to be SLR(k) for k = 2 or 3 (caused by only 
one or two inadequate states, at that), and none to require the more 
complex L(m)R(k) or general LR(k) techniques. Thus, the poor state of 
our strategy regarding those complex techniques is not likely to be a problem, 
at least with respect to programming languages. However, if our TWS is 
to be employed in some other application where more complex grammars are 
to be expected, that strategy will require development. Otherwise, a con- 
siderable amount of computation time is likely to be expended in deciding 
whether a grammar is, indeed, LR(k) for an acceptable k. 

7. 3 Tabular Translators, an Interpreter 

In this section we present a method of representing our translators 
by means of tables, and we present via a flowchart an interpreter for those 
tables. We first illustrate our storage method by using our trivial SSTG G : 
then we present the interpreter. (However, the reader may find it helpful 
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to reference the interpreter (Figure 7. 2 below) as he follows the description 
of the storage method.) This implementation works only for LR(1) grammars 
whose CFSMs have no multiply inadequate states. Nonetheless, it covers 
our practical example which is presented in Section 7. 4. We discuss in 
Section 7. 6 the modifications necessary to cover the general case. 

Shown in Figure 7.1 is a tabular representation of the translator 
of Figure 6. 1 (page 143). Note that we have stored the information regarding 
states, transitions, and look-ahead sets in a STATE TABLE (ST), a 
TRANSITION TABLE (TT), and a LOOK -AHEAD TABLE (LAT), respectively. 
Each entry in the ST corresponds to a state and it has three components. 
The first, TYPE, indicates the type of state and it can have one of the seven 
values: READ, LA (look-ahead), POP (pop and output), LB (look-back), 
EXIT (the terminal state), *READ, and *LA, the last two of which indicate 
states which push (I) their names (ST line numbers) on the stack. This 
covers all types of states whioh can appear in our translators. In the 
case of a POP, state, the second component, NUM, is the number of state 
names to pop from the stack. However, in all other cases NUM is the 
number of transitions from the state. The transitions are represented by 
contiguous entries in the TT and the third component, TTREF, is a reference 
to the topmost of these entries; i. e. , it is a TT line number. 

Each entry in the TT consists of two components, SYM and STATE. 
In the case of the entries for a READ or IREAD state and all but the last 
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STATE TABLE (ST) 
TYPE NUM TTREF 



TRANSITION TABLE (TT) 
SYM STATE 



READ 

1 4-READ 

2 READ 

3 POP 

4 +READ 

5 POP 

6 — 

7 LA 

8 *READ 

9 POP 

10 POP 

11 *READ 

12 READ 

13 POP 

14 EXIT 

15 LB 

16 

17 LB 



1 

2 1 
2 3 

1 6 

2 1 

1 7 

2 8 
2 1 

1 10 

11 

2 1 
2 4 

1 12 

4 13 

2 13 



LOOK-AHEAD TABLE (LAT) 
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Figure 7.1. The DPDA-translator for the example SSTG 
G tl represented by tables t i.e., a tabular version of 
Figure 6.1. 
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entry for an LA or |LA state, each entry represents a read transition 
under SYM to STATE. The last entry for an LA or *LA state represents 
a look-ahead transition to STATE under the look-ahead set implied by 
the SYM-th row of the LAT. If there is a "l" in column t of row n of 
the LAT, where t is a terminal symbol, then t is in the look- ahead set 
implied by row n. In the case of an LB state the corresponding TT entries 
represent look- back transitions; i. e. , each means if the top state -name 
on the stack is the same as SYM, go to STATE. 

For POP states there is always only one transition and it is under 
"pop NUM, output SYM" to STATE. TYPE is the only relevant component 
for the EXIT state. 

The following examples illustrate the meanings. 

(1) From line one of the ST we see that state 1 is a push- then- 
read state, i. e. , it is represented by a square in the corresponding 
state diagram, and it has two transitions which are listed contiguously 
in the TT starting at line one. From the TT we see that state 1 has 
an i- transition to state 10 and a (-transition to state 11. 

(2) From line nine of the ST we see that state 9 is a pop- then- output 
state which, since the NUM component is 1 and the TTREF points 
to the pair ( t , 15), has a transition under "pop 1, output t " 

to state 15. Note that some of the POP states should output 
nothing, as indicated by € in the SYM component of their TT entries. 
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In such cases our interpreter will actually output something, namely 
e , therefore our ASTB will have to have a nugatory subroutine which 
will be called when this happens. Our practical translator below 
has so few such POP states that we thought it not worth the cost 
of eliminating the inefficiency. 
(3) From line seven of the ST we see that state 7 is a look-ahead state 

with two transitions. One is actually a read transition under t 
to state 8 and the other is a look-ahead transition to state 15. The 
SYM component of line nine of the TT indicates that the look- ahead 
set { -) , +, )} is implied by line one of the LAT. Note that the LAT 
is included purely for the sake of earlier error detection since, if 
only strings in L(G ) were being translated, we could be sure upon 
arriving in state 7 that the next symbol would be t , -\ , +, or ). 
However, since our string may not be in L(G ), we take the 
attitude that once a symbol has affected any decision it must be 
validated. 

After two more comments we present the interpreter. First, the 
"holes" in the ST, lines 6 and 16, could obviously have been filled by 
renumbering the states; however, we choose not to so that the one-to-one 
correspondence with Figure 6. 1 would be preserved. Second, it should be 
noted that some of the lines of the TT are referenced by more than one 
state. For example lines one and two are referenced by states 1, 4, 8, 
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and 11. This is an important optimization of the use of space in the TT 
which we use extensively in our practical example. We have computed 
the optimization by hand here; however, there exists a graph- theoretic 
method for doing it automatically (I&M 69). 

The interpreter . Since the reader presumably already knows what 
the actions of our DPDA are supposed to be and what the meanings of the 
tables are, we will not elaborate extensively on the operation of the 
interpreter. However several comments are in order. (1) The interpreter 
is presented via a flowchart in Figure 7.2 and it is described as if it were 
part of our compiler model of Chapter 6. (2) The variable, Stack, denotes 
a large vector which we use as our pushdown stack. The variable, S, is 
used as the stack index. The top name on the stack is always Stack (S _ l). 
(3) We do not have to initialize any input string or pointer to one, since 
that initialization is affected when the lexical translator is initialized, 
before the interpreter is activated. Input and look-ahead symbols are 
acquired from the syntactic queue Ql as described in Chapter 6. When Ql 
is called with argument LA, the symbol in the queue is returned as the value, 
but the symbol is not removed from the queue. When Ql is called with 
argument READ it both returns the symbol as its value and removes the 
symbol from the queue. (4) The variables, READ, LA, POP, LB, EXIT, 
^READ, and ALA, may be thought of as denoting some distinct constant 
values. (5) The variables, ST, TT, and LAT, denote two dimensional 
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arrays which represent tables such as the ones in Figure 7. 1. The 
variable, State, denotes the current state, which is represented by an ST 
line number. The current reference is kept in TTRef. We can view 
TYPE = 1, NUM = 2, TTREF = 3, SYM = 1, and STATE = 2, so that , e. g. , 
if State = 10, ST(State, NUM) has the value stored in the tenth row, second 
column of the ST. (6) ASTB is, of course, the abstract- syntax- tree builder 
of Chapter 6. 

7. 4 A Practical Example 

The programming language PAL (Pedagogic Algorithmic Language) 
(Eva 68, Eva 69, W&E 69) is used as a vehicle to teach some of the 
fundamentals of programming linguistics to undergraduates interested in 
computer science at the Massachusetts Institute of Technology. It is one 
of the more progressive languages in existence today, being a decendent 
of ISWIM (Lan 66). In a sense PAL is a generalization of ALGOL 60 (Nau 63); 
it has the general functional capabilities of LISP (McC 65), generalized 
structures, and generalized jumps. 

PAL's Grammar . Of course most of this is irrelevant for our 
purposes here. It is the syntax of PAL in which we are interested. Since 
the formal definition of PAL specifies the set of legal programs as a CF 
language, we do not have to remove any "context-sensitive features" from 
the syntax. The syntax is similar to that of ALGOL 60, but it is considerably 
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"cleaner" and it is unambiguous. It is specified via modified Backus 
Naur Form (BNF) which, for our purposes, is just a shorthand way of 
writing CF productions. 

As we noted above, the PAL grammar was designed, for the sake 
of pedagogy, to be unambiguous, small, concise, and useful as a syn- 
tactical reference. Except for the fact it was designed to be unambiguous, 
it can truly be said that the grammar was not designed to be within the 
domain of our parser constructing technique. And yet, the grammar turns 
out to be SLR(l). 

A slightly modified version of the PAL grammar is presented in 
Table 7-1 where nonterminals are denoted by one or two capital letters, 
pseudo- terminals by three or more capitals, and other terminals by strings 
of small letters and/or special characters. The grammar differs from 
real PAL in several respects, which, for our purposes, are minor: (1) 
it includes new constructs which the author has proposed be added to PAL, 
(2) the original uses "regular expressions" in some alternatives to indicate 
nonassociative operators, e. g.. , DA ::= DR [and DRj , and we have changed 
these in an obvious way to get a strict CF grammar which generates the 
same strings, (3) the original grammar has the definitions of CONST and 
RLN built-in, whereas we have moved them into the lexical domain, and 
(4) the operator $ here has different precedence relative to other operators 
than it has in real PAL. 
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(1) 


P n = 


(3) 


PL n = 


(5) 


E IF 


(8) 


EW t i = 


(10) 


EV ii* 


(12) 


C n = 


(14) 


CL i i = 


(16) 


CC n = 




1 




1 


(23) 


CB ii = 


(27) 


T n = 


(29) 


TA n = 


(3D 


TC n = 


(33) 


TE 1 1= 


(35) 


B n = 


(37) 


BT ii = 


(39) 


BS i: = 
BP n= 


(43) 


A n = 


(48) 


AT n= 


(5D 


AP n = 


(53) 


AP it= 


(55) 


R it= 


(57) 


RN n = 


(61) 


D tt= 


(63) 


DI n= 


(65) 


DA n = 


(67) 


DR n= 


(69) 


DB tt= 


(73) 


V n= 


(75) 


VB it = 


(78) 


VL i:= 



PL I E 
def D PL 



I def O 



let D In E I f n VB 
EV where DR I EV 
valof C I C 



E I EW 



CL | C I CL 

NAME i CL I CC 

test B If so CL lfnot CL I test B lfso CL lfnot CL 

If B do CL I unless B do CL I while B do CL 

until B do CL I CB 

T i= T I goto R I res TIT 



TA , T I TA 

TA aug TC I TC 

B -> TC bar TC I TE 

$ R I B 

B or BT I BT 
BT & BS I BS 
not BP I BP 
A RLN A I A 

A + AT I A - AT I 
AT * AP I AT / AP I 
AP ** AP I AP 
AP % NAME R I R 



i comment i "bar" really 
should be "I" but^lf it 
were, the BNF would 
read Incorrectly. 



AT 
AP 



I - AT I AT 



R RN I 
NAME I 



RN 
CONST 



1(E) I C E ] 



DI within D I DI 

DI inwhich DA I DA 

DR and DA I DR 

rec DB I DB 

VL * E I NAME V = E I 

VB V I VB 

NAME | ( VL ) I ( ) 

NAME , VL I NAME 



( D ) I C D 1 



Table 7-1. The PAL grammar. It has 48 terminals, 3 of 
which are pseudo- terminals (NAME, CONST, and RLN), 32 
nonterminals, and 80 productions. The grammar is SLR(l). 
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Some statistics pertinent to the PAL grammar are as follows. It 
has 48 terminals, 3 of which are pseudo- terminals, 32 nonterminals, 
and 80 productions. The corresponding CFSM has 157 states, 26 of 
which are inadequate, but none of which are multiply inadequate, and 61 
of which must push their names on the stack during parsing. 

Since our interest here is primarily in PAL's CFST, we concentrate 
on its transduction grammar rather than the production- node pairs. The 
SSTG is implied by PAL's output grammar which is presented in Table 7-2. 
The output grammar is our own concoction; heretofore, the correspondence 
between PAL programs and abstract syntax trees has been specified 
informally by PAL designers. 

When the SSTG is viewed as a specification of the CFST, the 
pseudo- terminals should be regarded as nonterminals; however, if the 
outputs from the CFST and the lexical translator together, as seen by the 
ASTB, are being specified, the pseudo-terminals should be regarded as 
terminals. (Recall the restriction discussed on page 160 and the summary 
of the interactions of the components of our compiler model on page 154). 

In most cases the abstract- syntax- tree node corresponding to a 
given production can be determined from its transduction element co' as 
follows: if oj' consists only of a nonterminal, there is no node, or if only 
a pseudo- terminal, then a terminal node with "semantics", of if co' = y6 
where y is a string of nonterminals and pseudo -terminals and 6 is a 
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I D lastdef 



D E let I 
EV DR where 
G valof I 



VB E X 
I EV 

C 



t EW 



CL C ,• I CL 

NAME CL t I CC 

B CL CL test-t I B CL CL test-f » B CL if 

B CL unless I B CL while I B CL until I CB 

T T i= I R goto I T res I T 



■ TAT , I TA 

■ TA TC aug I TC 

■ B TC TC test-t I TE 

■ R $ I B 

» B BT or I BT 

= BT BS & I BS 

= BS not | BP 

= A RLN A rln I A 

- A AT + | A AT - I 

- AT AP * I AT AF / 
= AP AP ** I AP 
= AP NAME B^ I R 



AT pos 
I AP 



I AT neg I AT 



R RN y I RN 
NAME I CONST 



I E I E 



DI D within I DI 

DI DA inwhich | DA 

DR DA and | DR 

DB rec I DB 

VL E - | NAME V E f f 

VB V bv I VB 
NAME | VL I () 
NAME VL vl | NAME 



Table 7-2. The output grammar for PAL. 
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terminal, the node has name 6 and |co'|-lsons which are the nodes 
corresponding to the symbols in y and in the same order. Exceptional 
cases are trivially different and of no importance here, since they concern 
the node-building subroutines of PAL's ASTB, but not its CFST. 

PAL's CFST is presented in Figure 7.3 where the ST spans the 
first two pages, the TT spans the first three, and the LAT is shown in the 
fourth page. In the latter case only the numbered lines are. to be considered 
in the LAT; we have included the extra lines to indicate for each nonterminal 
A the set F (A) for the thoroughly interested reader. 

Spac e - ef f ic iency . For lack of a better choice, we define the space- 
efficiency of a translator T corresponding to an SSTG G to be the ratio of 
the space necessary for storing G to that for storing T. 

Let us compute a rough estimate of the space- efficiency of PAL's 
CFST. The ST contains 172 entries. There are seven possible values for 
the TYPE component, requiring three bits, values as high as 18 in the 
NUM component, requiring five bits, and values as high as 254 in the TTREF 
component, requiring eight bits. Thus, the ST requires 172*(3+5+8) = 2752 
bits. The TT contains 255 entries. The largest values in the SYM and 
STATE components are 154 and 171, respectively, requiring eight bits each. 
Thus, the TT requires 255*(8+8) = 4080 bits. The LAT requires 16 rows, 
each with 48 binary entries, or 768 bits. In total the translator requires 
7600 bits, or 238 words at 32 bits per word, of memory space. 
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Now there are 80 distinct symbols necessary to store the SSTG, 
thus requiring seven bits each. If for each production p, A -» 60, we 
assume we need store only |oo|+2 symbols, one for the left part, 
1 CO | for the right part, and one for an output terminal, then it takes 342*7 = 
2394 bits = 75 words to store PAL's SSTG. Thus the space-efficiency of 
the PAL translator is a respectable 75/238 = 31%. It seems clear that 
this figure could be increased somewhat by bringing to bear some coding 
tricks; however, it is not our purpose here to develop an optimal imple- 
mentation as regards either space or time. In fact, our scheme is already 
competitive with existing schemes, as we show next by comparing it with 
one which is well known to be fast and efficient (F&G 68). 

7. 5 Comparison with a Precedence Scheme 

For the sake of simplicity we compare parsers rather than 
translators, and we use the PAL grammar when we need pertinent statistics. 
In Figure 7.4 we present a flowchart describing a variation of a "Simple 
Precedence" parser (W&W 66) which is compatible with our terminology 
and compiler model. We do not detail the actions of the parser but only 
note that it makes read-reduce decisions and locates reducible substrings 
by looking up "precedence relations" in a precedence matrix (PM), and it 
determines which production, with left part A and output symbol OutSym, 
is applicable by searching (via Search) the set of productions to find one 
with a right part that matches the reducible substring. 
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Figure 7.**. The interpreter for a "Simple 
Precedence" parser. 
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Space . Each entry in the PM can have one of four values, <• , = , 
•>, or "none", so two bits are required for each. The rows and columns 
of the PM correspond to the symbols in the grammar. For PAL there 
are 80 symbols so the size of the corresponding PM would be 
80*80*2 = 12, 800 bits = 400 words. The size of the production table with 
output symbols would probably be greater than what we calculated above: 
75 words. Thus, the whole parser would require greater than 475 words, 
or twice as much as our translator above. Of course our parser might 
be somewhat larger than the CFST above because of the extra output 
symbols, but probably no more than 15% larger. A more significant 
difference would be that our interpreter would be larger than that for the 
precedence scheme, perhaps by an extra 50 words or so (an "educated 
guess"). On the other hand, the amount of space necessary for the stack 
during execution for our scheme would be less than that for the PM scheme 
(see page 62). In conclusion, then, the two schemes are roughly com- 
parable in space usage. 

Time . Let us now compare the speeds of the schemes. Following 
this paragraph are four lists of statements which must be executed in the 
performance of reads and reductions by the two schemes. To the left of 
the statements we indicate very rough estimates of the time required to 
execute each statement individually (generally one time unit per statement) 
and each group of statements. The groups comprise statements which 
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are executed variable numbers of times depending on the production or 
state involved. We weight these groups according to statistics derived 
from PAL's grammar, CFSM, or translator, as appropriate, and we 
indicate the pertinent statistics to the right of the statements. 

Precedence Matrix (PM) readj 

1 Symbol-*- Ql(READ) we count only the store j no lexical 

2 Rln«#- PM(Staok(S ), Symbol) analysis 
1 Rln = •> ? 

1 Rln = ± ? 

1 Symbol = -j ? 

1 S <+ S + 1 

1 Stack (S) -♦ Symbol 

"8 time units total 

CFST read: 

1 ST (State, TYPE) = ( DREAD or U)LA ? 

61 »RBAD ai 

82 U)READ and UJLA states 



i «r r 1 Stack(S) -+ State 1 * 61 »READ and ILA states 

° ll S *- S + 1 J - - -- 

Syml 



1 Symbol*- Ql(READ) 

Last* TTRef + ST(State,NUM) - 1) 
Symbol = TT (TTRef, SIM) ? V*1.7 
1 TTRef ♦ TTRef + 1 

1 TTRef > Last ? J 

(£ (linear search) * avg. no. of read transitions^ 

C from U)LA and (i)READ states) 

1 State -«• TT( TTRef .STATE) 
1 TTRef * ST (State , TTRef ) 

12.3 time units total about 1.5 times as long 

as a PM read 




^^^^(^r^^Sf^ 



^•^m^p^i^s^^^^'^L' .v^s** 



PM reduce: 
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Symbol* Ql(READ) happens infrequently 
2 Bin* PM ( Stack (S), Symbol) 

1 Rln = •> ? 
1 i* S - 1 
1 j* S 

f2 PM (S tack ( i), Stack (j)) 
9.2 ■{ 1 j * i 

t 1 i* i - 1 



* ']• 2 -3 sii 



ibols 

part 



6,6 A, OutSym - Search (Prods, Stack ( j). . .Stack (1) ) 

(1 time unit per store + 2 per each symbol in right part) 
1 S* j 
1 Stack (S) -4- A 
1 ASTB(OutSym) 
23.8 time units total 



CFST reduce: 



3.6V 



ST (State, TYPE) = U)LA ? 1 
Stack (S)* StateU 6 1LA states / 
S* S + 1 J 26 U)LA s 
LASymbol* Ql(LA) 
Last* TTBef + ST(State,NUM) - 1 
LASymbol ■ TT (TTRef, SYM) ?*) -- 
TTRef* TTRef + 1 f*5* 

11 TTRef > Last ? J ~° 
(avg. no. read transitions / 
from (*)LA states ) 



' 60 productions 



. 2 LAT(TT( TTRef, SYM), LASymbol) « 1 ?) 
0.6 £ 1 ST (State, TYPE) = POP ? } *W80 POP states/productions 
1 S* S - ST(State,NUM) 
1 Call ASTB(TT( TTRef, SYM)) 



ST (State, TYPE) = LB ? 
TopState* Stack(S-l) 
Last* TTRef + ST ( State, NUM) - 1 
TopState « TT (TTRef, SYM) ?") Qn 
TTRef-* TTRef + 1 f*g£ 
TTRef > Last J " 

(avg, no. transitions from LB J 

states » h (linear search))^ 

8.0 time units total about 1/3 as long as a PM reduce 




22 LB states 
Q0 productions 



In conclusion, we see from the above that on the average the PM 
scheme reads symbols about 1. 5 times as fast as does ours, but our scheme 
makes reductions about three times as fast as the PM scheme. 
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Of course, our estimates are very rough, but we believe it is clear from 
this that the two schemes are also roughly comparable as far as speed 
goes. 

7. 6 Variations, Extensions 

There are, of course, many ways in which our scheme could be 
speeded up. We mention two here because they seem particularly 
appropriate. First, the read states with many transitions could be 
implemented as "transition matrix (TM)" look-ups; i. e. , such that, if 
the next symbol is Symbol and the current "TMREAD" state is State, 
then TM(State, Symbol) would be the next state. This would substantially 
increase the average read speed at some storage cost. Since for PAL 
there are 18 read states with 10 or more transitions, the cost would be 
about 18*48* = 6912 bits = 216 words extra to implement those 18 as 
"TMREAD" states. Second, whether or not the first method is used, the 
ST and TT could be compiled rather than interpreted. In the nature of these 
things we might expect a factor of ten increase in speed for a factor of 
four increase in space, say. Since this would still leave us with a 
reasonable amount of space usage, it would represent a reasonable space- 
time trade-off for our purposes. The main point here is that our implement- 
ation method is flexible. 

Extentions. We next discuss the modifications to our implementation 
methods necessary to cover multiply inadequate states and k-symbol look-ahead. 
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Our intent is merely to indicate the ease with our method can be extended 
to cover these "exceptional cases. " 

Multiply inadequate states. In general, the multiply inadequate 
state may have several read transitions and several look-ahead transitions. 
For example, we might have the following; 
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{...} 



{■■■} 
{•••} 

One way to implement such a state would be first to implement it as we did 
above but with only one of the look-ahead transitions and then to store the 
extra look-ahead transitions in the TT immediately below the other transi- 
tions as follows. For each transition add two entries to the TT: (1) the 
first having an irrelevant STATE component and a special symbol, '"MORE*, 
in the SYM component which has a representation distinct from all other 
items which can appear in the SYM component, and (2) the second having 
a regular (SYM, STATE) pair corresponding to the transition in question. 
For the above state the table entries would be as follows. 
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STATE TABLE 
TYPE NUM TTREF 
M U)LA 3 



rl, r2, and r3 
are references 
to the LAT 



TRANSITION TABLE 


SYM 


STATE 
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rl 
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*MORE* 




r2 
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♦MORE* 


... 


r3 


R 



It should be clear that we can implement any multiply inadequate state in 
this way. Of course our interpreter will have to be modified to be ready 
for such states. The modification is trivial. It concerns only the bottom- 
most decision box in Figure 7. 2. The NO exit must be changed as indicated 
next. 



, ^/ LAT (TT(TTR«/, SrM), LAS^M> 



Gtl ERROR •+^ ^TT(TTR4+l i SYM)**H0*eb ^£- TTfof-TTfty-g 



Look-ahead for k> 1. To cover look- ahead of more than one 
symbol, we could add a new type of state, namely LAk for "look-ahead 
at the k-th symbol. " This would require an additional exit point from the 
topmost decision box in Figure 7.2. We illustrate our proposed modification 
via example. Suppose we want to implement the following state. 
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{ abc, e] 
(M) -— -o(g) 



[ abd, f} 



*© 



What we propose can be represented diagrammatically as follows. 

{e} 



<& 



*>® 



IQ 



^g) 



{a} 



1>©- 



{Dbj 



[DD C ] 

R) — >® 




{DQd} 



■*>© 



We intend to imply by this diagram that M is a look-ahead state with 
three look-ahead transitions of the normal type, that Q has one look-ahead 
transition but it indicates a comparison with the second symbol ahead rather 
than the first, and that R has two look-ahead transitions which investigate 
the third symbol ahead. State M could be implemented as we have just 
discussed. States Q and R would be LAk states with tabular representations 
as follows. 
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The corresponding section of the interpreter would be as follows, where 
we assume that, when Ql is called with argument LAn for some specific 
n = 2, 3, . . . , it returns as value the n-th symbol in the queue, counting 
from the bottom, but it does not change the queue. 



<" ST(Sfc t &,,TYPE)=r~^> 



LAk 



LfKS^jU *-Ql(TT(TTR^-l,SYm)) 
L»*b — TTfc<+ST(StJ*,NOM> - 1 



no/ \yes 

TrR*4+"TT&( + i ( NEXT state) 



NO / * S.Y5S i 

1 — \TrK4>L«& ) >— \CM error 



Note that if we use the variable Symbol rather than LASymbol above, the 
flowchart from the line beginning Last. . . down is exactly the same as the 
counterpart in the READ section. Thus, the modification requires only 
one new exit from the TYPE-test box, one extra statement, and a transfer 
into the READ portion of the interpreter. The only question remaining, then, 
is how do we deduce the new states and their interconnections ? We believe 
the answer to this question is obvious so we do not treat it here. 

Thus, we have shown that "exceptions" like multiply inadequate states 
and k- symbol look- ahead states can be implemented with little change to our 
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interpreter and with changes which do not affect the speed of the interpreter 
for "normal" states. We have, then, a very flexible method. 
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Chapter 8 
CONCLUSIONS 

8. 1 Future Development 

Before our results will be ready for actual incorporation in a 
TWS several variations should be investigated as possible improvements. 
These variations concern computational methods, strategy, diagnostics, 
and translator- implementation methods. 

Computational Methods . We believe, for instance, that Knuth's 
parser-generating technique (Knu 65) could be adapted for use in the 
generation of CFSMs. Specifically, we believe that the set of all possible 
"state sets" generated by his algorithm for grammar G and k = is 
isomorphic to the set of states of G's CFSM. Furthermore, if the latter 
is true, the "bit matrix" techniques of Lyncte (Lyn 68) can probably be 
used for the very fast generation of CFSMs. We suspect that the resulting 
method would be faster than our piecemeal method in Section 7.1. 

Another possible area of improvement regards the computation of 
look-ahead sets and context pairs and the attendant strategy. We do not 
think this will be a critical issue for programming languages because we 
expect most of the related grammars to be either SLR(l) or very nearly 
SLR(l). However, for the sake of generality, exceptional cases, and 
the possible use of our TWS to build systems for more general "syntax- 
directed" computations, it would be reasonable to research further in this 
area. 
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Strategy : Of course, the obvious thing to do in complex cases 
is to make the TWS interactive so the language designer, who presumably 
knows the grammar best, can assist in determining strategy. It may 
be reasonable, however, to provide more (or other) than the three 
methods for computing look- ahead sets that were provided above. Con- 
veniently, our technique as a whole is amenable to other such methods; 
i. e. , we do not care how look-ahead sets and state splitting are computed 
as long as they result in a correct parser. 

Computation of look- ahead, state- splitting . With regard to this 
area of possible improvement we briefly list three methods which should 
be investigated. 

(1) Especially if Knuth's algorithm is adapted for the generation of CFSMs, 
it should also be investigated for possible adaptation for computing simple 
1-look-ahead sets. This would require the separation in his technique of 
the computations of look-ahead sets and state -splitting, which we believe 
to be easy to do. Actually, we believe the resulting technique would cover 
slightly more than the SLR(l) grammars, perhaps with little or no more 
complexity (computation time) than our SLR(l) technique. We do not 
believe, however, that the technique would be nearly as fast as our 
SLR(k) technique for k > 1 because we see no simple way of using it to 
compute look-ahead for a single state. 
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(2) Lynch (Lyn 68) has a fast technique for computing left and right 
context in which each symbol is computed independently of all others. 
This technique should be investigated as a possible prelude to or 
replacement for the computation of corresponding context pairs. 

(3) Finally, for the general LR(k) case the look- ahead sets might be 
most easily computed by simulation of the parser in a nondeterministic 

manner; i. e. , for each left context <p see if <p(3y is a canonical form, where 

* 
07 e V and |/3| < k, by determining if there is any sequence of actions 

by the stack algorithm which will cause <pfl to be read. Of course it must 

be proven that this method would result in the appropriate look-ahead sets. 

Diagnostics . A related area which needs investigation concerns 
diagnoistic messages to the language designer. What information would 
be useful to the designer when his grammar is found not to be SLR(k) or 
LR(k)? Presumably, in such cases the designer has inadvertently submitted 
an ambiguous grammar, since we expect all of his unambiguous grammars to 
be SLR(k) or at least LR(k). The diagnostics should, of course, lead the 
designer to find the reason why the grammar is ambiguous. 

Implementation methods . Finally, there are several possible ways 
in which our translator implementation could be improved. First, a way 
of implementing in a reasonable amount of space states which jump over 
long cascades of look-ahead and look-back states is desirable. We suspect 
that these can be implemented by using bit matrices in a manner similar to 
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precedence techniques. Second, similar bit-matrix techniques may also 
be useful for speeding up read states with many transitions, rather than 
using transition matrices. Third, we believe that the obvious way (for 
most applications) to implement the state and transition tables which 
remain after the above modifications is to compile them into machine code. 

8. 2 Conclusions 



We believe that we have demonstrated the validity of our thesis. 
We have a practical translator -cons true ting technique which grows in 
complexity as it discovers the complexity of the grammar at hand, and it 
generates practical translators for SSTGs which are based on LR(k) 
grammars and which partically specify useful, readable programming 
languages. Thus we have a basis for a TWS in which the key feature is 
flexibility. 

First and foremost, we have given the language designer flexibility 
in the design of his grammar. From the beginning it has been our desire 
to get a method which would accept a CF grammar as it was designed as 
a syntactical reference for a language, with no_ modifications. That is, 
we wanted a method that would accept a "humanized" version of the syntax. 
To the extent that unambiguity is considered a desirable trait of such a 
reference, we believe we have such a method. This belief is founded on 
the intuitive grounds that, when a designer sets out to define part of the 
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syntax of a programming language via a CF grammar, he will just naturally 
come up with an LR(k) grammar, and in fact, probably an SLR(l) grammar. 

Second, we have given the implementor of the TWS flexibility. 
He has the flexibility to build-in whatever strategy is appropriate for the 
purposes at hand for deciding whether grammars are LR(k), and he can 
leave some of that strategy to be decided by the language designer. He 
also has the flexibility not only to implement each translator as a whole 
in a variety of ways, but also to implement particular states in special 
ways. In fact, see (DeR 68) for a proposal concerning the use of different 
kinds of parsing techniques on different parts of a grammar. 

8. 3 Future Research, Extensions 

The area of future research most important to our results is that 
of language design and specification itself. The value of our results is 
somewhat limited until there is developed a useful, unified methodology 
for specifying programming languages fully, which incorporates something 
similar to SSTGs and/or production-node pairs. We have proceeded on the 
assumption that such a methodology is forthcoming, and we have faith 
that one is (see for example (Knu 66) and (Tho 69)). 

A more specific design problem, which is part of the above area 
and important to our results, is the one discussed in detail in Section 6. 8. 
Once the designer has in mind a set of abstract syntax trees, operator 
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precedences and associativities, scopes of variables, etc. , how can he 
algorithm ically generate an appropriate CF grammar which has the 
corresponding "structural properties" and which is guaranteed to be 
LR(k), or even better SLR(l)? Currently, the generation of such grammars 
is definitely an art, being performed on the basis of past experience and 
trial-and-error methods. 

Another related problem is that of extending the usefulness of 
CF grammars, and therefore, BNF. There are three ways in particular 
in which we would like to see their powers extended. It goes without saying 
that we would also like to see our techniques extended to cover these 
extensions. 

(1) We often like to indicate via regular expressions in right parts 
of productions that certain operators are nonassociative. For instance, for 
PAL the following production-node pair specifies the correspondence between 
an abstract- syntax- tree node and strings involving the nonassociative 
(syntactic) operator "and": 

(p) DA ::= DR { and DR }* 




DR DR DR 



There seems to be no natural way of indicating this correspondence using 
pure BNF. Can we construct a CFSM from a grammar including the above 
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production in a manner similar to that given in Section 7.1? Presumably, 
the piece of FSM corresponding to that production is as follows: 



DA 



m. 



r 



and 



X_#£. 



ODR 



DR 



But what should be the "reduction procedure" executed by the corresponding 

DPDA for the # -transition? How should that procedure interact with the 
P 

ASTB? 

(2) One can often indicate a reduction in the need for parentheses in certain 

special contexts via special context-sensitive productions. For instance, 

to use a trivial example, the meaning of the following subexpression seems 

clear: 

... (1 + if. B then 2 else 3). . . 

And yet the ALGOL 60 syntax disallows it, requiring the programmer to 
write: 

• • • U + HL B then 2 else 3)). . . 

Often the set of legal programs can be extended to include subexpressions 
such as the former one above by adding either a large number of CF pro- 
ductions or only one or two context-sensitive productions to the grammar. 
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If in the latter case the resulting language is CF, our results should, in 
theory, still be applicable. Can we get sufficient conditions on the 
allowable contexts such that we still have a CF language? Can we modify 
our DPDA in a simple way so that it checks these contexts at the appro- 
priate times and therefore recognizes the intended language? 
(3) Finally, since we are really interested in translations rather than 
parses, can we change our notions of unambiguity to correspond to the 
former rather than the latter in such a way that we can extend our techniques 
to cover all "unambiguous" SSTG's? See (Eva 65) for some results in this 
area. 

More in regard to compilers, we note that it is probably true that 
if we retain transitions under nonterminals, we would have an "incremental 
compiler"; i. e. , one which would accept a string which is already partially 
parsed. (A proof is needed. ) If these transitions were stored in some 
special place, rather than directly in the CFST, and if the reads and look- 
aheads concerning nonterminals were treated as special cases, our 
compiling speed for terminal strings would not be reduced. Perhaps a 
compiler would be constructed using this technique which would have good 
recompilation characteristics, and therefore, good overall "efficiency". 

Finally, our automata- theoretic tendencies lead us to ask if we are 
on the verge of a result regarding the minimality of DPDAs, at least with 
respect to parsers for CF grammars. Our DPDA-parsers are based on 
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CFSMs which are reduced, and therefore minimal. Is the DPDA which we 
get by starting with a minimal FSM in some meaningful sense a minimal 
version of any other DPDA which affects the same parsings? We know 
of no existing results in this area. 



^^^^. ^■ . iSMfl^ ^ ■ 
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APPENDIX 
WEAK PRECEDENCE GRAMMARS 

A CF grammar is called g -free if and only if it has no productions 
with empty right parts. 

An e-free CF grammar is called weak precedence (HbM 69) if 
and only if (1) no two productions have identical right parts, (2) at most 
one of the following relations hold between any two of the symbols in V: 



(a) X < X if A-* a X X a is a production, or A" a X A ff 

X Ci X X ct ct X X Ci & 

■if. 

is a production and A -* X a 

Ct Ct O 

(b) X > X if A- a A X <t (or A -♦ o A A o ) is a production 

and Aj - CTgXj (and A - X a ). 

and (3) neither of these relations hold between X and A if there exist 

X Ct 

productions A - a^X X a and A -• X a „. 

J. X X Ct Ct Ci Ct Ct 

The sequence of theorems below proves that any weak precedence 
grammar is SLR(l). The inverse is not true: grammar G (page 29) is 
LR(0) (and therefore SLR(l)), as was shown in Chapter 3, but it is not weak 
precedence since productions 3 and 6 have identical right parts. 

Lemma A. 1. Let G be a CF grammar and N be a state of G's 
CFSM having a # -transition, whose production p is A -» u*. 
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Then any string <p which accesses N must end in 

to; i.e. , (p = poo for some p € V . 

Proof: The CFSM accepts the characteristic string 

<p# , therefore there exists a canonical form <p/3 = puj3 

which reduces to pAS, by definition of characteristic 

strings. Q. E. D. 

Theorem A. 2. When the CFSM of a weak precedence 

grammar G enters an inadequate state, the last 

symbol of the left context is implicitly known. 

Proof: The fact that G is €~free in conjunction with 

wp 

Lemma A. 1 proves this. Q. E. D. 

Lemma A. 3. Let G be a CF grammar with characteristic 

string pCT,X,X v# . Then X, < X . 
&r ll2p 1 2 



Proof: By definition of characteristic strings 



* 



pa X X yfi is a canonical form, for some j8 in V . Thus, 

either 

S -* pAfl - pa^X^J -* po^X^fi 

where cj -* y, or 



S - pA8 - pa^A^ - po^Xf^p = pojXfayB 
where A g - X a , ffg - a^ € V T , and y = a tr^ . 
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These are the only two possibilities and each implies that 
X < X . Q. E. D. 

■i. A 

Theorem A. 4. The CFSM of a weak precedence grammar 

has no multiply inadequate states. 

Proof: Lemma A. 1 implies that any <p which accesses a 

state N with transitions under distinct # and # must 

P q 

end in both oo and U) , where productions p and q are 
P q 

A -* (ii and A -» 0) , respectively. But we cannot have 
P P q q 

CO = 00 for distinct p and q, because that would violate 
P q 

condition (1) in the definition of weak precedence. 

Furthermore, if |co I > loo I then we have oo =ff,X 1 X n a n 

P q P 112 2 

and 60 = X cr . But this implies that <pf = pa,X,X_a_# 
q 2 2 q 1 1 2, I q 

is a characteristic string (by Lemma A. 1), and therefore, 

* 
that pa X A jS is a canonical form, for some /? in V , 

whose characteristic string is po.X.AQf for some 

6 in V and some production r. Thus, Lemma A. 2 implies 

that X < A . But that violates condition (3) of the definition, 

so no such state N can exist. Q. E. D. 



Theorem A. 5. Let G be a weak precedence grammar. 

Then G is SLR(l). 
wp 
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Proof: Because of Theorem A. 4 and the SLR(k) definition 

(page 73) we need only prove, for any inadequate state N of 

G 's CFSM with (among others) transitions under some 
wp 

terminal t and # for some production p which is A -» o>, 
P 

that t is not in the set F (A). Consider the following. 
F^A) = {(1£) c V* | S-pAj8} 

= {X 2 e V T | S -* pA0 - pa 1 A 1 X 2 a 2 l3 or 

pA8 - pa 1 A 1 A 2 a 2 /3 and A g - X^ or 
pAfi - pCTjAjX^gjS and A x - a^ or 
pA8 -• pa 1 A 1 A 2 <r 2 |3 and A x - ct^ 

and A 2 -* X 2 o 4 } 

Thus, the relation > holds between the last symbol of the 
left context implicit when the CFSM is in N (i. e. , 03:1 by 
Lemma A. 1 and Theorem A. 2) and every symbol in F (A). 
But from Lemma A. 1 all of the characteristic strings which 
correspond to the t- transition are of the form <pt0# = poit0# , 
so Lemma A. 3 implies that (tt):l) < t. Since condition (2) 
of the definition of weak precedence states that both the 
relations < and > cannot hold between (o):l) and t, we see 
that t is not in F;L(A). Q. E. D. 
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